File Systems

File Systems

Learning Materials

Videos:

Disks

Introduction to file systems

File system naming

FFS

File system recovery

Journaling file systems

LFS

Lecture Slides:

File systems

Readings:

[OSTEP] Hard disk drives

[OSTEP] Flash-based SSDs

[OSTEP] File system implementation

[OSTEP] Locality and the FFS

[OSTEP] Crash consistency: fsck and journaling

[OSTEP]: Log-structured File Systems

ctFS paper at FAST'22 conference

Hard Disk Drives

Disk Basics

A complete I/O:

First a seek, then waiting for the rotational delay, and finally the transfer(read/write).

TI/O=Tseek+Trotation+TtransferT_{I/O} = T_{seek} + T_{rotation} + T_{transfer}

the rate of I/O(RI/O):RI/O=SizeTransferTI/Othe\ rate\ of\ I/O (R_{I/O}): R_{I/O} = \frac{Size_{Transfer}}{T_{I/O}}

Disk Scheduling

  1. SSTF: Shortest Seek Time First

How can we implement SSTF-like scheduling but avoid starvation?

  1. Elevator (a.k.a. SCAN or C-SCAN)

How can we implement an algorithm that more closely approximates SJF by taking both seek and rotation into account?

  1. SPTF: Shortest Positioning Time First

Linux I/O Schedulers(*)

Goals of I/O schedulers

  • Minimize disk seeks
  • Ensure optimum disk performance
  • Provide fairness among I/O requests

Balancing these goals or trading them against one another is the essence of the art of creating an I/O scheduler.

Four schedulers in Linux

  • NOOP

All incoming I/O requests for all processes running on the system, regardless of the I/O request (e.g., read, write, lseek, etc.), go into a simple first in, first out (FIFO) queue. The scheduler also does request merging to reduce seek time and improve throughput.

NOOP scheduler does not make any attempts to reduce seek time. Therefore, storage devices that have very little seek time benefit from a NOOP I/O scheduler.

  • Anticipatory

It anticipates subsequent block requests and implements request merging, a one-way elevator, and read and write request batching.

After the scheduler services an I/O request, it pauses slightly to see if the next request is for the subsequent block.

A web server can achieve better performance while database run can suffer from a slowdown with anticipatory I/O scheduling.

  • Deadline

Maintain two deadline queues in addition to the sorted queues for reads and writes.

Sorted queues are sorted according to their sector numbers(the elevator approach).

Queues for reads are prioritized. If the first request in the deadline queue is expired, execute it immediately; otherwise, the scheduler serves a batch of requests from the sorted queue.

Useful for real-time applications as it keeps latency low, and for database systems that have TCQ-aware disks (the hardware makes I/O scheduling decisions).

  • CFQ

Default Linux kernel scheduler

CFQ synchronously puts requests from processes into a number of per-process queues then allocates time slices for each of the queues to access the disk.

Flash-based SSDs

Flash Basics

Whenever we want to write to a page, we need to erase the whole block first.

Primary concern of flash chips: wear out!

What is wear out?

When a flash block is erased and programmed, it slowly accrues a little bit of extra charge. Over time, as that extra charge builds up, it becomes increasingly difficult to differentiate between a 0 and a 1. At the point where it becomes impossible, the block becomes unusable.

From Raw Flash to Flash-Based SSDs

FTL(flash translation layer): FTL turns clients' read and write requests into internal flash operations(low-level read, erase, and program).

FTL Organization: A Log-Structured FTL

How to translate logical block(issued by the client) into physical page(on the actual SSD device)?

FTL maintains a in-memory mapping table.

What happens when the client(file system) writes a page to the SSD?

When the client writes to the SSD, The SSD picks the next free page and programs the page with the block's contents, and records the logical-to-physical mapping in its mapping table.

Three major problems with log-structured FTL

  • Cost of Garbage Collection

What happens when we overwrite a page?

Old pages have garbage(old-version data) in them. We need Garbage Collection!

Garbage Collection: Find a block that contains one or more garbage pages, read in the live (non-garbage) pages from that block, write out those live pages to the log, and (finally) reclaim the entire block for use in writing

  • Large size of the mapping table

What if we have a very large mapping table?

  1. Hybrid Mapping(log-based + block-based mapping): Maintain a log table(a pointer per page) and a data table(a pointer per block)

  2. Page Mapping Plus Caching: Caching hot pieces of the FTL

  • Wear Leveling

How to balance the erase/program loads among physical blocks of the flash chip so that all blocks wear out roughly at the same time?

The FTL occasionally migrate data from blocks that are mostly read(seldom overwritten) in order to ensure said blocks receive their share of the erase/program load.

File System Implementation

Introduction to file systems

vnode is the internal structure that describes the file.

The key part of defining a file system is to design the data structure that maps from an abstract notion of a file to a collection of disk blocks.

File structure:

Inode Structure

The Inode Table

The Multi-Level Index File System:

The inode contains some direct pointers pointing to data blocks, and also contains some indirect pointers pointing to indirect blocks, in which the pointers point to data blocks.

Directory

Directory is just a special type of file that stores name→inode-number mappings. Thus, a directory has an inode, somewhere in the inode table.

Root directory has a designated inode number(2).

Locality and The Fast File System

Data structures of the old file system(really simple!):

However, the old file system has pretty bad performance, so the crux of our problem:

How can we organize on-disk data to improve performance?

FFS: Disk Awareness

FFS divides the disk into a number of cylinder groups.

However, modern drives do not export their geometry information to the file system. Disks only export a logical address space of blocks. Therefore, modern file systems instead organize the drive into block groups.

Accessing files within the same group won't result in long seeks across the disk.

FFS includes all the structures you might expect a file system to have within each group.

A single cylinder group

FFS Policies

Assumptions we make?

File Locality: We assume that files within the same directory accessed together more often.

How to place directories?

Find the cylinder group with

  1. a low number of allocated directories (to balance directories across groups)
  2. a high number of free inodes(to subsequently be able to allocate a bunch of files)

How to place files?

  1. Allocate the data blocks of a file in the same group as its inode.
  2. Place all files that are in the same directory in the cylinder group of the directory they are in.

What is the large-file exception? Why we should place large files differently?

A large file would entirely fill the block group, which prevents subsequent “related” files from being placed within this block group, hurting file-access locality.

So our policy is: After some number of blocks are allocated into the first block group(the number of direct pointers available within an inode), FFS places the next “large” chunk of the file(those pointed to by the first indirect block) in another block group, and so on…

File spread across groups

Crash Consistency: FSCK and Journaling

The crux of our problem:

The system may crash or lose power at any time during disk operations, and thus the on-disk state may only partially get updated. How do we ensure the file system keeps the on-disk image in a reasonable state?

Solution #1: The File System Checker

Let inconsistencies happen and then fix them later (when rebooting).

By consistency, we mean that the file system metadata is internally consistent.

fsck is a UNIX tool for finding such inconsistencies and repairing them:

  • Superblock: Check if the superblock looks suspicious and use an alternate copy of the superblock if it finds a suspect superblock.

    Recall that each block group has a copy of the superblock.

  • Free blocks: Produce a correct version of the allocation bitmaps by scanning the inodes, indirect blocks, double indirect blocks, …, to resolve any inconsistency between bitmaps and inodes.

    For now, we trust the information with the inodes.

  • Inode state: Check each inode, clear the suspected inode and update the inode bitmap correspondingly.

    The information within the inodes can be corrupted.

  • Inode links: Verify the link count of each allocated inode.

    Recall that the link count indicates the number of different directories that contain a reference (i.e., a link) to this particular file.

    Aside: What's the difference between hard link and symbolic link?

    Symbolic Link: A symbolic link is a special type of file(like regular files and directories), which stores the path name(string) of the target file. It has its own inode: symlink's inode.

    Hard Link: Different directory's entries point to the same inode and are calculated by link count. Deleting the file will decrease the link count by 1.

    1
    2
    3
    4
    5
    6
    Symbolic Link:
    dir entry -> symlink's inode -> (存储路径字符串: "/path/to/target")

    Hard Link:
    dir entry 1 ──┐
    dir entry 2 ──┴─> target's inode -> (实际数据)
  • Duplicates: Check for duplicated pointers, i.e., cases where two different inodes refer to the same block.

  • Bad blocks: Check for bad block pointers and remove the pointers from the inode or indirect block.

  • Directory checks: Check on the contents of each directory.

fsck requires scanning the entire disk, so the problem of fsck is obvious:

As disks grew bigger, the performance became prohibitive.

Solution #2: Journaling (or Write-Ahead Logging)

Basic idea: When updating the disk, before overwriting the structures in place, first write down a little note to a structure(somewhere else on the disk, in a well-known location) that we organize as a "log", describing what you are about to do.

An ext2 file system (without journaling):

An ext3 file system adds a new key structure: journal

Data journaling

What if TxE is successfully written to the log while Db(user data) is not?

If we write theses five blocks all at once, if the system crash before Db is completely written to disk while TxE has been successfully written to disk. The system will assume that journal write is successful, despite it is not, causing problems.

How to avoid this problem?

The file system issues the transactional write in two steps:

  1. Write all blocks except the TxE block at once to the journal.

  2. When those writes complete, the file system issues the write of the TxE block, thus leaving the journal in this final, safe state:

TxE should be a single 512-byte block to make the write of TxE atomic(the disk guarantees that any 512-byte write will either happen completely or not at all).

Some file systems(e.g. Linux ext3) perform batching log updates by buffering multiple updates into a global transaction in memory, reducing write traffic to disk.

What if the log becomes full?

We make the log a circular log by freeing the space occupied by checkpointed transactions.

Mark the oldest and newest non-checkpointed transactions in the log in a journal superblock:

Our final data journaling protocol:

  1. Journal write: Write the contents of the transaction (including TxB, metadata, and data) to the log; wait for these writes to complete.

  2. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for write to complete.

    The transaction is now committed.

  3. Checkpoint: Write the contents of the update (metadata and data) to their final on-disk locations.

  4. Free: Some time later, mark the transaction free in the journal by updating the journal superblock.

However, we are writing each data block to the disk twice, which is a heavy cost to pay considering system crash is rare. How to improve our protocol?

Metadata Journaling

User data is not written to the log. Instead, they are written to disk first.

Our protocol:

  1. Data write: Write data to final location; wait for completion (the wait is optional).

  2. Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete.

    Step 3 should wait for step 1&2 to complete! Step 1 and Step 2 and run concurrently.

  3. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete.

    The transaction (including data) is now committed.

  4. Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system.

  5. Free: Later, mark the transaction free in journal superblock.

The rule is: Write the pointed-to object before the object that points to it!

Linux ext3 gives you the option of choosing either data, ordered, or unordered modes (in unordered mode, data can be written at any time).

Log-structured File Systems

Write to disk sequentially

The crux of the problem:

How to write to disk sequentially?

Before writing to the disk, LFS keeps track of updates in memory.; when it has received a sufficient number of updates, it writes them to disk all at once.

An example in which LFS buffers two sets of updates into a small segment:

Find Inodes

How to find inodes in LFS?

Recall in FFS we split up the inode table into chunks and places a group of inodes within each cylinder group. So we need to know the address and size of each inode chunk before calculating the address of an inode.

Finding inodes in old Unix file system or FFS is simple because all inodes are placed in a fixed area on disk. You can simply calculate an inode's location given its inode number. However, in LFS, inodes are all across the disk and we never overwrite any inode so the latest version of an inode keeps moving.

LFS introduces the inode map (imap). The imap is a structure that takes an inode number as input and produces the disk address of the most recent version of the inode.

Note that the inode map in LFS is a virtualization of inode numbers.

LFS places chunks of the inode map right next to where it is writing all of the other new information to avoid frequent disk seeks:

Now the question is:

How do we find the inode map, considering now that pieces of it are also now spread across the disk?

LFS has a fixed place on disk, known as the checkpoint region (CR), containing pointers to the latest pieces of the inode map.

The entire imap is usually cached in memory, so the extra work LFS does during a read is to look up the inode’s address in the imap.

The imap also solves the recursive update problem. While an inode keeps moving across the disk, its inode number remains unchanged. Only the imap structure is updated while the directory holds the same name-to-inode-number mapping.

Garbage Collection

How to collect garbage in LFS?

Note that LFS always writes the latest version of a file to new locations on disk, leaving old versions of file structures(garbage) scattered throughout the disk.

LFS periodically find these old dead versions of file data, inodes, and other structures, and clean them.

The LFS cleaner works on a segment-by-segment basis so that subsequent writing can find large contiguous chunks of space. Specifically, we expect the cleaner to read in M existing segments, compact their contents into N new segments (where N < M), and then write the N segments to disk in new locations.

So our problem now is:

  1. How can LFS tell which blocks within a segment are live, and which are dead?
  2. How often should the cleaner run, and which segments should it pick to clean?

LFS stores a little extra information(for each data block D, its inode number and its offset) that describes each block in a structure at the head of the segment, known as the segment summary block.

Determine whether a block is live or dead:

1
2
3
4
5
6
7
8
// N is the inode number
// T is the offset
(N, T) = SegmentSummary[A];
inode = Read(imap[N]);
if (inode[T] == A)
// block D is alive
else
// block D is garbage

File Systems
http://oooscar8.github.io/2024/12/31/File-System/
作者
Alex Sun
发布于
2024年12月31日
许可协议