모각코 , lmbench-memory bandwidth

김필모·2024년 3월 8일

이 글은 lmbench: Portable tools for performance analysis 을 읽고 번역하고 정리한 글입니다. 용어에 덧붙이고자 하는 말은 ()괄호 안에 넣었고 추가적인 설명이 필요한 것은 부연설명 파트에 적었습니다.

5.1 Memory bandwidth

Data movement(read, write, copy) 는 어떤 os든 필수적이죠. 과거에 성능은 MFLOPS(컴퓨터가 1초동안 수행할 수 있는 부동소수점 연산의 횟수)로 종종 측정하곤 했습니다. fpu가 충분히 느려서 microprocessor system 에서 memory bandwidth가 제한될 일이 거의 없었거든요. 근데 오늘날에는 fpu가 memory bandwidth보다 훨씬 빠릅니다. 그래서 현재 많은 MFLOP rating은 cache only rating이에요.

우리는 크기를 달리하며 copy, read, write 를 측정할거에요. 그 중에서도 이 문서에서는 large memory transfer에 대해 다루겠습니다.

copy bandwidth 를 측정하는 방법에는 2가지가 있습니다.

첫번째로는 user-level library 인 bcopy interface를 사용하는겁니다.

두번째 방법은 aligned 8-byte words를 load 하고 store 하는 hand-unrolled loop(1)를 하는 것입니다.

주의할건 두가지 방법 모두 direct-mapped 캐시에서는 source 와 destination이 캐시의 같은 라인에 있으면 안된다는 겁니다. (그러면 cache collision이 일어나니까)

cache bandwidth가 아닌 memory bandwidth를 측정할 때는 8M area → 8M area copy를 측정합니다. (secondary caches가 16M이 된다면 당연히 resizing을 해줘야 합니다. 그렇지 않으면 caching effect가 생길테니까요)

copy 결과는 memory bandwidth의 1/2 에서 1/3 정도가 나타납니다. reading 과 writing memory를 하기 때문이죠. 만약에 cache line 크기가 메모리에 저장된 word 보다 크다면 cache line은 쓰기 전에 읽혀질거에요. 실제 amount of memory bandwidth는 아키텍쳐마다 다른데요 이것은 어떤 아키텍처는 bcopy 에 의해 지정된 특별한 instruction들이 있기 때문입니다.

최적화 옵션이 켜져있다면 memory contents는 대부분 모든 c 컴파일러에서 최적화될 것입니다. 또 만약 켜져있지 않다면 너무 많은 instructions이 생기겠죠. 그래서 해결책으로 데이터를 합산하고 결과값을 “finish timing” 함수에 전달합니다.(2)

Screenshot 2024-01-11 at 5.03.08 PM.png

memory read 는 bcopy의 1/2 에서 1/3 정도니(쓰기 오버헤드가 없으니까) 저희는 pure read가 bcopy보다 대략 2배 정도의 스피드를 보이기를 기대합니다. 예외가 있다면 벤치마크의 버그가 있거나 bcopy 문제 또는 좀 특이한 hardware라서 그럴거에요.

memory write는 unrolled loop로 측정됩니다. 각 memory operation의 processor cost는 읽기와 쓰기에서 대략 같습니다.

(1) 부연설명

hand-unrolled loop란 loop 문을 최적화하는 기법 중 하나에요. 예를 들어볼게요

for (int i = 0; i < n; i++) {
    process(array[i]);
}

위와 같은 기본적인 loop 문을 이렇게 바꾸는 것을 말합니다

for (int i = 0; i < n; i += 4) {
    process(array[i]);
    process(array[i + 1]);
    process(array[i + 2]);
    process(array[i + 3]);
}

첫번째 코드보다 loop condition check도 덜하고 ‘i’ increment도 덜하죠.

aligned 8-byte words 를 load 하고 store 하는 hand-unrolled loop 한다고 했으니 아래와 같이 생각해볼 수 있어요.

for (int i = 0; i < n; i += 2) {
    // Assume that 'array' is an array of 8-byte data types (like 'long long' in C)
    // and is 8-byte aligned in memory.

    // Loading two 8-byte words from the array into registers
    long long word1 = array[i];
    long long word2 = array[i + 1];

    // Some processing on word1 and word2

    // Storing the processed words back into the array or another array
    processed_array[i] = word1;
    processed_array[i + 1] = word2;
}

(2) 부연설명

memory operation benchmarking을 하고 싶은데 loop안이 비어있다면 컴파일러가 (프로그래밍 상) 쓸모 없다고 간주해서 이를 지울 수도 있기 때문에 sum-up 하는 코드를 넣고 이를 finish timing 함수에 넘김으로써 컴파일러한테 우리 좀 의미있는 일을 하니 없애지 말아달라고 부탁하는 거에요.

예를 들면

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// Function to finish timing and use the sum to prevent optimization
void finish_timing(unsigned long long sum) {
    printf("Sum: %llu\n", sum);
}

int main() {
    const int SIZE = 1000000; // Size of the array
    int *array = malloc(SIZE * sizeof(int));
    unsigned long long sum = 0;
    clock_t start, end;
    double cpu_time_used;

    // Initialize the array with some values
    for (int i = 0; i < SIZE; i++) {
        array[i] = i;
    }

    // Start timing
    start = clock();

    // Loop through the array and sum up the values
    for (int i = 0; i < SIZE; i++) {
        sum += array[i];
    }

    // End timing
    end = clock();

    // Calculate the time taken
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    // Use the sum to prevent the loop from being optimized out
    finish_timing(sum);

    printf("Time taken: %.6f seconds\n", cpu_time_used);

    free(array);
    return 0;
}

소감

팀원들과 함께 같이 공부하니까 더 의지가 불타올라서 공부하게 되는 것 같다!!!
스터디 시간이 너무 즐겁다!

김필모

이전 포스트

HeatBeat - final

다음 포스트

모각코 , lmbench-memory bandwidth

5.1 Memory bandwidth

(1) 부연설명

(2) 부연설명

소감

HeatBeat - final

모각코 , lmbench-memory bandwidth

0개의 댓글