python set은 어떻게 동작하는 걸까?

Matthew Woo·2022년 1월 5일

Data structure

목록 보기

2/3

프로그래머스에서 "숫자의 표현" 이라는 문제를 풀었다.
다른 사람의 풀이와 비교해봐도 'set' 을 이용해서 나름 잘 푼거같은데..
그래서 set은 어떻게 동작하는거지? 어떻게 시간복잡도가 O(1)이 나와서 효율성이 이렇게 좋게 뜨는지 문득 생각이 들었다.

사실 알고리즘 푼 경력(?) 이 상당히 짧지만 set()을 쓰면 시간 복잡도가 O(1)이라길래 몇번 쓴 적이 있다만
쓸 때마다 속성들을 찾아보면서 이게 왜 이런거지라는 생각을 몇번 하고 넘어갔다가 이번에 또 set을 쓰면서 이번에는 알아봐야겠다라는 생각에 찾아보게 되었다.

결론 부터 말하자면 python set()이 항상 시간 복잡도 O(1)을 보장하는 것이 아니며, 메모리를 add된 key 보다 약 두배 가까이 할당해놓고 있으며,
구현은 역시나.. hash table이었다.****

시간복잡도 는 이곳 파이썬 공식문서에 나와있고 해당 문서에서 python set이 구현되어 있는 c언어 코드를 같이 제공한다.

geeksforgeeks를 보면 그림으로 잘 보여주고 잇는데, 인덱스가 다를 경우 O(1)에 접근하지만 인덱스 충돌이 일어날 경우 내부는 Linked list 로 구현되어 있어서 worst의 경우에는 O(n)이 발생할 수 있다. 허나 이러한 경우가 거의 발생하지 않는 이유는,

hashtable에 set.add()를 해주는 상황에서 hashtable 를 resize 해주는 작업을 하는데,
이 resize 작업이라는게 크기를 키운 새로운 테이블을 할당하고 기존의 키들을 다시 새로 복사하여 넣는다.
헐.. 뒷단에서 이런 일들을 해주고 있었을 줄이야..

/*
Restructure the table by allocating a new table and reinserting all
keys again.  When entries have been deleted, the new table may
actually be smaller than the old one.
*/
static int
set_table_resize(PySetObject *so, Py_ssize_t minused)
{
    setentry *oldtable, *newtable, *entry;
    Py_ssize_t oldmask = so->mask;
    size_t newmask;
    int is_oldtable_malloced;
    setentry small_copy[PySet_MINSIZE];

    assert(minused >= 0);

    /* Find the smallest table size > minused. */
    /* XXX speed-up with intrinsics */
    size_t newsize = PySet_MINSIZE;
    while (newsize <= (size_t)minused) {
        newsize <<= 1; // The largest possible value is PY_SSIZE_T_MAX + 1.
    }

    /* Get space for a new table. */
    oldtable = so->table;
    assert(oldtable != NULL);
    is_oldtable_malloced = oldtable != so->smalltable;

    if (newsize == PySet_MINSIZE) {
        /* A large table is shrinking, or we can't get any smaller. */
        newtable = so->smalltable;
        if (newtable == oldtable) {
            if (so->fill == so->used) {
                /* No dummies, so no point doing anything. */
                return 0;
            }
            /* We're not going to resize it, but rebuild the
               table anyway to purge old dummy entries.
               Subtle:  This is *necessary* if fill==size,
               as set_lookkey needs at least one virgin slot to
               terminate failing searches.  If fill < size, it's
               merely desirable, as dummies slow searches. */
            assert(so->fill > so->used);
            memcpy(small_copy, oldtable, sizeof(small_copy));
            oldtable = small_copy;
        }
    }
    else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;

    /* Copy the data over; this is refcount-neutral for active entries;
       dummy entries aren't copied over, of course */
    newmask = (size_t)so->mask;
    if (so->fill == so->used) {
        for (entry = oldtable; entry <= oldtable + oldmask; entry++) {
            if (entry->key != NULL) {
                set_insert_clean(newtable, newmask, entry->key, entry->hash);
            }
        }
    } else {
        so->fill = so->used;
        for (entry = oldtable; entry <= oldtable + oldmask; entry++) {
            if (entry->key != NULL && entry->key != dummy) {
                set_insert_clean(newtable, newmask, entry->key, entry->hash);
            }
        }
    }

    if (is_oldtable_malloced)
        PyMem_Free(oldtable);
    return 0;
}

그럼 가장 위의 내가 풀었던 알고리즘 풀이를 공개해보겠다.

def solution(n):
    memo = set([0])
    answer = 0
    
    for i in range(1, n+1):
        sum = i * (i + 1) // 2 # 1 ~ n번까지 더한 값
        memo.add(sum) # 그 값을 메모해둠
        
        need = sum - n # 1~ n 번까지 더한 값(sum)에 1 ~ k번까지 더 한 값(need)이 n이 되어야함
        if need in memo:
            answer += 1
    
    return answer

문제 예시에서는 n 사이즈를 15로 주었고 문제 제출을 하면 약 20개 이상의 테스트가 돌아갔는데,
과연 내가 set.add()를 하면서 resize 작업은 몇 번 일어났을까?

import sys

def solution(n):
    memo = set([0])
    answer = 0
    initial_size = sys.getsizeof(memo) # 초기 사이즈
    count = 0 # 몇 번 resize 되는지
    
    for i in range(1, n+1):
        sum = i * (i + 1) // 2 
        memo.add(sum) 
        new_size = sys.getsizeof(memo)
        
        if initial_size != new_size:
            print(i, initial_size, new_size) # (몇 번째 삽입할때, 기존  사이즈, resize 후 사이즈)
            count += 1 # resize 횟 수
            initial_size = new_size
        
        need = sum - n 
        if need in memo:
            answer += 1
    
    print(count)
    return answer

주석달아놓은 부분의 코드를 조금 더 넣어보았다.

15개의 범위에서는 한 번 resize가 발생한다.

n = 100일 때,
4 216 728
18 728 2264
76 2264 8408
3

n = 1000일 때,
4 216 728
18 728 2264
76 2264 8408
306 8408 32984
4

n = 100,000 일 때,
4 216 728
18 728 2264
76 2264 8408
306 8408 32984
1228 32984 131288
4914 131288 524504
19660 524504 2097368
78642 2097368 4194520
8

n = 1,000,000 일 때
4 216 728
18 728 2264
76 2264 8408
306 8408 32984
1228 32984 131288
4914 131288 524504
19660 524504 2097368
78642 2097368 4194520
157285 4194520 8388824
314572 8388824 16777432
629144 16777432 33554648
11

재밌는 점은 resize하면서 일정 범위까지는 size를 4배를 늘리고, 특정 이상 범위가 넘어가면 2배를 늘리게 되어있다.
전체 코드가 있는 부분에서 set_add_entry 부분이 있는데, 해당 부분을 보면

  found_unused:
    so->fill++;
    so->used++;
    entry->key = key;
    entry->hash = hash;
    if ((size_t)so->fill*5 < mask*3)
        return 0;
    return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);

resize해주면서 사용하고 있는 메모리 바이트가 50000이상이 되면 두배씩, 그 전까지는 4배로 사용한다.
초기 선언을 하면 bucket 사이즈는 8이며, 4배씩 늘다가 이후 2배씩 늘려준다. 4/8, 18/32, 76/128 의 0.6 미만의 한도로 bucket이 추가될 때 resize 작업이 수행된다.

이러한 과정을 통해서 set()이라는 함수는
특정 entry가 있는지 없는지 O(1)을 보장하기 위해 해당 존재여부를 확인해줄 때는 O(1)이지만 add로 늘려주는 상황에서는 resize하는 부분에서 시간을 소요하게 되고 메모리도 list()에 비해 꽤나 많은 용량을 사용하게 된다.

그럼 또 궁금한점.. remove나 discard 하면서도 resize가 되고 메모리 용량을 조절해서 줄여줄까?
글 작성하다 마무리하려다 코드를 돌려봐야겠다.

del_resize_cnt = 0
    for j in range(n, 0, -1):
        memo.pop()
        deleted_size = sys.getsizeof(memo)

        if initial_size != deleted_size:
            print(j, initial_size, deleted_size) # (몇 번째 삽입할때, 기존  사이즈, resize 후 사이즈)
            del_resize_cnt += 1 # resize 횟 수
            initial_size = deleted_size

    print(del_resize_cnt)
    print(deleted_size)
    return answer
    
    # result
4 216 728
18 728 2264
76 2264 8408
306 8408 32984
1228 32984 131288
4914 131288 524504
19660 524504 2097368
78642 2097368 4194520
157285 4194520 8388824
314572 8388824 16777432
629144 16777432 33554648
0 # del_resize_cnt resize 되지 않았음!!
33554648 # deleted_size # 사이즈가 그대로 유지되어있음

돌려본 결과 지워줄 때는 pop을 해도 resize로 메모리 사용량을 줄이거나 resize작업을 해주지 않는다.
한번 키워놓으면 돌이킬 수 없음!

geeksforgeeks
시간복잡도
c언어 코드

Matthew Woo

지속가능하고 안정적인 시스템을 만들고자 합니다.

이전 포스트

선밸런싱 이후 삽입, 삭제하는 레드블랙트리

다음 포스트

python set은 어떻게 동작하는 걸까?

Data structure

선밸런싱 이후 삽입, 삭제하는 레드블랙트리

Trie Datastructure

0개의 댓글