Understanding xv6: File System

1231·2024년 4월 23일

Understanding_xv6

목록 보기

6/6

The purpose of file system:

It represent the tree named directories and files to identify the each block.
(Representation)
It must be can recovery from the crash such as interrupt, failure and etc.
(Crash Recovery)
It must provide concurrency to different processes that is trying to access the same file system.
(Concurrency)
It must maintain the in-memory cache of the popular datas.
(Caching)

7 Layers of the xv6 File system:

Layers of the xv6 file system:

각 레이어 역할 한줄 정리:
Buffer Cache Layer: caches a sector in memory, ensures there is at most one buffer cache per sector.
...

Buffer Cache Layer

two jobs:
1. synchronize the access to disk to ensure that only one copy of a block is in memory.
2. caches the popular datas to in-memory

This layer exports few important interfaces: bread(), bwrite(), brelse()

bread() to obtain a "buf" containing the copy of a block from the disk that can read/modifed in memory. Returns locked buffer
bwrite() to write the "buf" to the disk block from the modifed data in memory.
brelse() to release the buffer by a kernel thread.

It uses LRU(Least Recently Used) algorithm to recycle the in-memory cache.

main.c calls binit() to creates a buffer cache as doubly linked-list.
->It initializes the list with the NBUF buffers in static array "buf".

B_VALID indicates that the buffer has valid copy of a block
B_DIRTY indicates that the buffer has been modified and have to be written to the disk block.
main.c

17 int
 18 main(void)
 19 {
 ...
 31   binit();         // buffer cache
 ...
 38 }

bio.c

29 struct {
 30   struct spinlock lock;
 31   struct buf buf[NBUF]; 
 32
 33   // Linked list of all buffers, through prev/next.
 34   // head.next is most recently used.
 35   struct buf head;
 36 } bcache;
 37
 38 void
 39 binit(void)
 40 {
 41   struct buf *b;
 42
 43   initlock(&bcache.lock, "bcache");
 44
 45 //PAGEBREAK!
 46   // Create linked list of buffers
 47   bcache.head.prev = &bcache.head;
 48   bcache.head.next = &bcache.head;
 49   for(b = bcache.buf; b < bcache.buf+NBUF; b++){
 50     b->next = bcache.head.next;
 51     b->prev = &bcache.head;
 52     initsleeplock(&b->lock, "buffer");
 53     bcache.head.next->prev = b;
 54     bcache.head.next = b;
 55   }

bread() first calls bget() to scan the buffer cache.

If the block is already cached:
bget() increment refcnt and return locked buffer.

If not:
bget() scan again to find a buffer with refcnt 0 and B_DIRTY 0 then recyling the buffer and returns locked buffer. bread() will call iderw() to read the data from the IDE disk.

In either case, bget() returns locked buffer.

bcache.lock protects information about which blocks are cached.
buffer sleep-lock protects writes and reads about a buffer.
These locks ensure that the at most one buffer cache exists per disk sector.
bio.c

61 static struct buf*
 62 bget(uint dev, uint blockno)
 63 {
 64   struct buf *b;
 65
 66   acquire(&bcache.lock);
 67
 68   // Is the block already cached?
 69   for(b = bcache.head.next; b != &bcache.head; b = b->next){
 70     if(b->dev == dev && b->blockno == blockno){
 71       b->refcnt++;
 72       release(&bcache.lock);
 73       acquiresleep(&b->lock);
 74       return b;
 75     }
 76   }

 81   for(b = bcache.head.prev; b != &bcache.head; b = b->prev){
 82     if(b->refcnt == 0 && (b->flags & B_DIRTY) == 0) {
 83       b->dev = dev;
 84       b->blockno = blockno;
 85       b->flags = 0; //it is important to clear the B_VALID field to ensure that the bread() will not read the incorrect buffer 
 86       b->refcnt = 1;
 87       release(&bcache.lock);
 88       acquiresleep(&b->lock);
 89       return b;
 90     }
 91   }
 92   panic("bget: no buffers");
 93 }
 94
 96 struct buf*
 97 bread(uint dev, uint blockno)
 98 {
 99   struct buf *b;
100
101   b = bget(dev, blockno);
102   if((b->flags & B_VALID) == 0) {
103     iderw(b);
104   }
105   return b;
106 }

bwrite() sets flag of the buffer t B_DIRTY to inform the iderw() of that this buffer should be written to the disk.
bio.c

109 void
110 bwrite(struct buf *b)
111 {
112   if(!holdingsleep(&b->lock))
113     panic("bwrite");
114   b->flags |= B_DIRTY;
115   iderw(b);
116 }

iderw() updates the locked buffer as indicated by the flag
ide.c

138 void
139 iderw(struct buf *b)
140 {
.....checking if the flag is B_DIRTY. If not, calls panic.
150   acquire(&idelock);  //DOC:acquire-lock
151
152   // Append b to idequeue.
153   b->qnext = 0;
154   for(pp=&idequeue; *pp; pp=&(*pp)->qnext)  //DOC:insert-queue
155     ;
156   *pp = b;
157
158   // Start disk if necessary.
159   if(idequeue == b)
160     idestart(b);
161
162   // Wait for request to finish.
163   while((b->flags & (B_VALID|B_DIRTY)) != B_VALID){
164     sleep(b, &idelock);
165   }
166
167
168   release(&idelock);
169 }

after the caller has done with the buffer, the caller must call brelse() to release the buffer lock.

Logging Layer

Logging exists to protect the file system from the inconsistency that comes from a crash during R/W operation.

Inode Layer

Inode is metadata about file or directory.
two types of inode:
1. on-disk structure dinode

These are packed into contiguous area of disk called inode block.
It is represented as "struct dinode".

in-memory structure inode
These are copy of a dinode and represented as "struct inode".

fs.h

28 // On-disk inode structure
 29 struct dinode {
 30   short type;           // File type file/directory/special type
 31   short major;          // Major device number (T_DEV only)
 32   short minor;          // Minor device number (T_DEV only)
 33   short nlink;          // Number of links to inode in file system, recognizes when to free this inode 
 34   uint size;            // Size of file (bytes)
 35   uint addrs[NDIRECT+1];   // Data block addresses
 36 };

Kernel keeps the copy of the struct dinode in memory" as "struct inode only if it has pointer to this inode.
Kernel discards the "struct inode" from memory when the refcnt drops to zero.
file.h

12 // in-memory copy of an inode
 13 struct inode {
 14   uint dev;           // Device number
 15   uint inum;          // Inode number
 16   int ref;            // Reference count, maintaining inode in cache while it is greater than zero 
 17   struct sleeplock lock; // protects everything below here
 18   int valid;          // inode has been read from disk?
 19
 20   short type;         // copy of disk inode
 21   short major;
 22   short minor;
 23   short nlink;
 24   uint size;
 25   uint addrs[NDIRECT+1];
 26 };

iget(): acquire the pointer to inode, providing non-exclusive access to the same inode.
iput(): drop the pointer to inode
ilock() to acquire the lock for cached inode, iunlock() to release.

Separating acquisition of inode pointers from locking helps avoid deadlock in some situations.

The icache.lock spin-lock protects the allocation of icache entries.
Since ip->ref indicates whether an entry is free, and ip->dev and ip->inum indicate which i-node an entry holds, one must hold icache.lock while using any of those fields.

An ip->lock sleep-lock protects all ip-> fields other than ref, dev, and inum. One must hold ip->lock in order to read or write that inode's ip->valid, ip->size, ip->type.

To allocate a new inode, xv6 calls ialloc.
Iallocs first gets buffer that is copy of block containing the inode.
If the dinode is free inode, ialloc() sets dip->type to type indicating this inode is allocated.
Then, ialloc() calls iget() to get in-memory inode pointer.
fs.h

194 struct inode*
195 ialloc(uint dev, short type)
196 {
197   int inum;
198   struct buf *bp;
199   struct dinode *dip;
200
201   for(inum = 1; inum < sb.ninodes; inum++){
202     bp = bread(dev, IBLOCK(inum, sb));
203     dip = (struct dinode*)bp->data + inum%IPB;
204     if(dip->type == 0){  // a free inode
205       memset(dip, 0, sizeof(*dip));
206       dip->type = type;
207       log_write(bp);   // mark it allocated on the disk
208       brelse(bp);
209       return iget(dev, inum);
210     }
211     brelse(bp);
212   }
213   panic("ialloc: no inodes");
214 }

iget() scans through the in-memory inode cache to check if the inode already cached.
If it is, incrementing the refcnt and return that inode pointer.

As iget() scans, it records the position of the first empty slot which will be recycled as inode cache entry.

241 static struct inode*
242 iget(uint dev, uint inum)
243 {
244   struct inode *ip, *empty;
245
246   acquire(&icache.lock);
247
248   // Is the inode already cached?
249   empty = 0;
250   for(ip = &icache.inode[0]; ip < &icache.inode[NINODE]; ip++){
251     if(ip->ref > 0 && ip->dev == dev && ip->inum == inum){
252       ip->ref++;
253       release(&icache.lock);
254       return ip;
255     }
256     if(empty == 0 && ip->ref == 0)    // Remember empty slot.
257       empty = ip;
258   }
259
260   // Recycle an inode cache entry.
261   if(empty == 0)
262     panic("iget: no inodes");
263
264   ip = empty;
265   ip->dev = dev;
266   ip->inum = inum;
267   ip->ref = 1;
268   ip->valid = 0;
269   release(&icache.lock);
270
271   return ip;
272 }

iput() decrements the refcnf of the inode.
If it is last reference, it calls itrunc() to trucate the file to zero bytes and frees the data block.
setting ip->type 0 is important because it indicates it is freed inode.

331 void
332 iput(struct inode *ip)
333 {
334   acquiresleep(&ip->lock);
335   if(ip->valid && ip->nlink == 0){
336     acquire(&icache.lock);
337     int r = ip->ref;
338     release(&icache.lock);
339     if(r == 1){
340       // inode has no links and no other references: truncate and free.
341       itrunc(ip);
342       ip->type = 0; //indicating now that it is freed inode
343       iupdate(ip); //write the update to disk 
344       ip->valid = 0;
345     }
346   }
347   releasesleep(&ip->lock);
348
349   acquire(&icache.lock);
350   ip->ref--;
351   release(&icache.lock);
352 }

challenging interaction between iput() and crashes:
after crash, file may be marked allocated on disk but no directory entry points to it.
Solution 1: scan file system to find the files that is allocated but with zero reference then frees it.

Solution 2: file system records on disk (superblock) the inode inumber of a file whose link count drops(directory link) to zero whose reference count (in-memory inode reference) isn't zero. (This mean there's no directory entry pointing to this inode but isn't freed yet.)

xv6 does not implements any solution for this recovery, thus in case of crash during iput() will cause kernel to run out of space.

Inode contents can be found in the blocks listed in the "struct dinode".
The first NDIRECT blocks are listed in the NDIRECT entries of the array(direct block) but other blocks are listed in the indirect entry of the array(indirect block).

first 6 KB -> by direct block
next 64 KB -> by indirect block

29 struct dinode {
 ...
 34   uint size;            // Size of file (bytes)
 35   uint addrs[NDIRECT+1];   // Data block addresses
 36 };

The function bmap() returns disk block number.
bmap() first checks if the requested offset is outrange, then if offset is of indirect block, it will try to get block number by referencing the indirect block.
If there's no allocated block for data addr, bmap() allocates space for the data.
bmap() makes it easy for readi() and writei() to get an inode's data.

372 static uint
373 bmap(struct inode *ip, uint bn)
374 {
375   uint addr, *a;
376   struct buf *bp;
377
378   if(bn < NDIRECT){
379     if((addr = ip->addrs[bn]) == 0)
380       ip->addrs[bn] = addr = balloc(ip->dev);
381     return addr;
382   }
383   bn -= NDIRECT;
384
385   if(bn < NINDIRECT){
386     // Load indirect block, allocating if necessary.
387     if((addr = ip->addrs[NDIRECT]) == 0)
388       ip->addrs[NDIRECT] = addr = balloc(ip->dev);
389     bp = bread(ip->dev, addr); //indirect block referencing 
390     a = (uint*)bp->data; 
391     if((addr = a[bn]) == 0){ //get bn'th block from the array
392       a[bn] = addr = balloc(ip->dev);
393       log_write(bp);
394     }
395     brelse(bp);
396     return addr;
397   }
398
399   panic("bmap: out of range");
400 }

readi() moves requested block to location dst.
It checks if the inode that it tries to access is special type T_DEV, and offset is out of range

write() moves src location as much as size n to requested block.

Directory Layer

...

File Descriptor Layer

... see sysfile.c

1231

이전 포스트

Understanding xv6: File System