3. File Format

anna ny·2023년 3월 21일

Database Internals

목록 보기
2/4
post-thumbnail

Memory vs Disk

Memory Access

  • Almost trasparent at the application level
    • with the use of virtual memory
    • no need to manage the offset

Disk Access

  • Through a system call (Linux kernel)
    • requires indication of the offset directly
    • requires conversion of the disk format to a memory-readable format

File Format Design

File Format

  • Similar to an unmanaged-memory-model programming language
  • Assigns a fixed-size data block
  • Uses a pointer for the reference of
    • memory chunk
    • variable length data structure
  • Doesn't care about
    • continuous memory segment
    • memory fragmentation
    • deallocation of memory
  • Should care about (disk)
    • garbage collection
    • fragmentation

Binary Encoding

  • Should encode data as a compact and serializable(binary) format
    • to save them efficiently on disk
    • In other words, to design an efficient page layout
  • Record
    • consists of a combination of data types like int, char, bool
    • transferred as a byte sequence format
    • through network and stored on disk
  • Variable Length data
    • uses Pascal-style string (null terminated)
    • consists of size and data
      String {
      	size uint_16
        data byte[size]
      }```
  • Bit Data
    • Boolean
      • uses 1 bit each
      • 8 booleans in 1 byte (grouping)
    • Enum
      • commonly used in binary format and communication protocol
      enum NodeType { 
        ROOT, // 0x00h 
        INTERNAL, // 0x01h 
        LEAF // 0x02h
      };
    • Flag
      - (grouping) boolean + enum
      - uses OR operator or bitmask(or shift op.)
      // Set a bit
      flags |= HAS_OVERFLOW_PAGES; flags |= (1 « 2);
      	
      // Remove a bit
      flags &= -HAS_OVERFLOW_PAGES;
      flags &= ~(1 << 2);
      	 
      // Check a bit
      is_set = (flags & HAS_OVERFLOW_PAGES) != 0;
      is_set = (flags & (1 << 2)) != 0;

Principle of File Format Design

  • Addressing

    • decide whether divide a file into a block or several blocks
    • In-place update mostly uses fixed-page size
    • with the ease of read and write
  • File Structure

    • header + pages + trailer
    • Most data stores have a pre-defined schema
    • no need to store field name reptitively


Page Structure

B-tree Node

  • consists of one page or several pages
  • has the same meaning as a page(block)

Fixed-length Record

  • has a triplet structure(k-key, v-value, p-pointer)
  • not proper for variable-length records
  • inefficient once returning deleted record pages

Slotted Page

  • Satisfies the followings:
    • stores variable-length record with the minimum overhead
    • returns deleted record pages efficiently
    • refers to records in pages regardless of the exact location
  • Consists of three components:
    • Header
      • contains important info of the page and cells
      • including DB version (in a way that every version can read the version value in the file)
      • checksum (to detect the damage of data)
    • Pointers
      • points to each cell
    • Cells
      • Key cell
      • Key-value cell
  • Manages record deletion by
    • deleting the cell of record
    • updating the memory size and pointer to the availability list
    • should find available segments using Best Fit or First Fit strategy

0개의 댓글