Question You are going to write an extended version of chapter 40 (File System Implementation). You are going
to add content to 40.2-40.4 that corresponds to an example file system. Note that you may need to
refer back to chapter 39 for the representation of the directories. After each section 40.2-40.4 you are
going to add an example that shows the following file structures. In 40.1 you will introduce the example,
and discuss the files from the perspective of the user. In 40.2-40.4 you will discuss the aspects of that
3 files (and the directories to support them).
-rwxr-xr-x
-rw-r--r--
-rw-r--
/users/<your login>/stuff.sh
/users/<your login>/stuff.txt
/users/<your login>/stuff/stuff.txt
With the following content:
stuff.sh
#!/bin/bash
ls -la
stuff.txt (1)
There is stuff here
stuff.txt (2)
There is also stuff here
Please hand in: A .pdf or .docx file/n Interlude: Files and Directories
Thus far we have seen the development of two key operating system ab-
stractions: the process, which is a virtualization of the CPU, and the ad-
dress space, which is a virtualization of memory. In tandem, these two
abstractions allow a program to run as if it is in its own private, isolated
world; as if it has its own processor (or processors); as if it has its own
memory. This illusion makes programming the system much easier and
thus is prevalent today not only on desktops and servers but increasingly
on all programmable platforms including mobile phones and the like.
In this section, we add one more critical piece to the virtualization puz-
zle: persistent storage. A persistent-storage device, such as a classic hard
disk drive or a more modern solid-state storage device, stores informa-
tion permanently (or at least, for a long time). Unlike memory, whose
contents are lost when there is a power loss, a persistent-storage device
keeps such data intact. Thus, the OS must take extra care with such a
device: this is where users keep data that they really care about.
CRUX: HOW TO MANAGE A PERSISTENT DEVICE
How should the OS manage a persistent device? What are the APIs?
What are the important aspects of the implementation?
Thus, in the next few chapters, we will explore critical techniques for
managing persistent data, focusing on methods to improve performance
and reliability. We begin, however, with an overview of the API: the in-
terfaces you'll expect to see when interacting with a UNIX file system.
39.1 Files And Directories
Two key abstractions have developed over time in the virtualization
of storage. The first is the file. A file is simply a linear array of bytes,
each of which you can read or write. Each file has some kind of low-level
name, usually a number of some kind; often, the user is not aware of this
1
39 2
INTERLUDE: FILES AND DIRECTORIES
foo
bar
bar.txt
bar
foo
bar.txt
Figure 39.1: An Example Directory Tree
name (as we will see). For historical reasons, the low-level name of a file
is often referred to as its inode number (i-number). We'll be learning a
lot more about inodes in future chapters; for now, just assume that each
file has an inode number associated with it.
In most systems, the OS does not know much about the structure of
the file (e.g., whether it is a picture, or a text file, or C code); rather, the
responsibility of the file system is simply to store such data persistently
on disk and make sure that when you request the data again, you get
what you put there in the first place. Doing so is not as simple as it seems!
The second abstraction is that of a directory. A directory, like a file,
also has a low-level name (i.e., an inode number), but its contents are
quite specific: it contains a list of (user-readable name, low-level name)
pairs. For example, let's say there is a file with the low-level name "10",
and it is referred to by the user-readable name of "foo". The directory
that "foo" resides in thus would have an entry ("foo", "10″) that maps
the user-readable name to the low-level name. Each entry in a directory
refers to either files or other directories. By placing directories within
other directories, users are able to build an arbitrary directory tree (or
directory hierarchy), under which all files and directories are stored.
The directory hierarchy starts at a root directory (in UNIX-based sys-
tems, the root directory is simply referred to as /) and uses some kind
of separator to name subsequent sub-directories until the desired file or
directory is named. For example, if a user created a directory foo in the
root directory /, and then created a file bar.txt in the directory foo,
we could refer to the file by its absolute pathname, which in this case
would be /foo/bar.txt. See Figure 39.1 for a more complex directory
tree; valid directories in the example are /, /foo, /bar, /bar/bar,
/bar/foo and valid files are/foo/bar.txt and /bar/foo/bar.txt.
OPERATING
SYSTEMS
[VERSION 1.10]
WWW.OSTEP.ORG INTERLUDE: FILES AND DIRECTORIES
3
TIP: THINK Carefully ABOUT NAMING
Naming is an important aspect of computer systems [SK09]. In UNIX
systems, virtually everything that you can think of is named through the
file system. Beyond just files, devices, pipes, and even processes [K84]
can be found in what looks like a plain old file system. This uniformity
of naming eases your conceptual model of the system, and makes the
system simpler and more modular. Thus, whenever creating a system or
interface, think carefully about what names you are using.
Directories and files can have the same name as long as they are in dif-
ferent locations in the file-system tree (e.g., there are two files named
bar.txt in the figure, /foo/bar.txt and /bar/foo/bar.txt).
You may also notice that the file name in this example often has two
parts: bar and txt, separated by a period. The first part is an arbitrary
name, whereas the second part of the file name is usually used to indi-
cate the type of the file, e.g., whether it is C code (e.g., . c), or an image
(e.g., . jpg), or a music file (e.g., .mp3). However, this is usually just a
convention: there is usually no enforcement that the data contained in a
file named main.c is indeed C source code.
Thus, we can see one great thing provided by the file system: a conve-
nient way to name all the files we are interested in. Names are important
in systems as the first step to accessing any resource is being able to name
it. In UNIX systems, the file system thus provides a unified way to access
files on disk, USB stick, CD-ROM, many other devices, and in fact many
other things, all located under the single directory tree.
39.2 The File System Interface
Let's now discuss the file system interface in more detail. We'll start
with the basics of creating, accessing, and deleting files. You may think
this is straightforward, but along the way we'll discover the mysterious
call that is used to remove files, known as unlink (). Hopefully, by the
end of this chapter, this mystery won't be so mysterious to you!
39.3 Creating Files
We'll start with the most basic of operations: creating a file. This can be
accomplished with the open system call; by calling open () and passing
it the O_CREAT flag, a program can create a new file. Here is some exam-
ple code to create a file called "foo" in the current working directory:
int fd =
open ("foo", O_CREAT | O_WRONLY | O__TRUNC,
S_IRUSR | S_IWUSR);
THREE
© 2008-23, ARPACI-DUSSEAU
EASY
PIECES 4
INTERLUDE: FILES AND DIRECTORIES
ASIDE: THE CREAT ( ) SYSTEM CALL
The older way of creating a file is to call creat (), as follows:
// option: add second flag to set permissions
int fd creat ("foo");
=
You can think of creat () as open () with the following flags: O_CREAT
| O_WRONLY | O_TRUNC. Because open() can create a file, the usage
of creat () has somewhat fallen out of favor (indeed, it could just be
implemented as a library call to open () ); however, it does hold a special
place in UNIX lore. Specifically, when Ken Thompson was asked what he
would do differently if he were redesigning UNIX, he replied: “I'd spell
creat with an e.”
The routine open () takes a number of different flags. In this exam-
ple, the second parameter creates the file (O_CREAT) if it does not exist,
ensures that the file can only be written to (O_WRONLY), and, if the file
already exists, truncates it to a size of zero bytes thus removing any exist-
ing content (O_TRUNC). The third parameter specifies permissions, in this
case making the file readable and writable by the owner.
One important aspect of open () is what it returns: a file descriptor. A
file descriptor is just an integer, private per process, and is used in UNIX
systems to access files; thus, once a file is opened, you use the file de-
scriptor to read or write the file, assuming you have permission to do so.
In this way, a file descriptor is a capability [L84], i.e., an opaque handle
that gives you the power to perform certain operations. Another way to
think of a file descriptor is as a pointer to an object of type file; once you
have such an object, you can call other "methods" to access the file, like
read() and write () (we'll see how to do so below).
As stated above, file descriptors are managed by the operating system
on a per-process basis. This means some kind of simple structure (e.g., an
array) is kept in the proc structure on UNIX systems. Here is the relevant
piece from the xv6 kernel [CK+08]:
struct proc {
struct file *ofile [NOFILE]; // Open files
;
A simple array (with a maximum of NOFILE open files), indexed by
the file descriptor, tracks which files are opened on a per-process basis.
Each entry of the array is actually just a pointer to a struct file, which
will be used to track information about the file being read or written; we'll
discuss this further below.
OPERATING
SYSTEMS
[VERSION 1.10]
WWW.OSTEP.ORG INTERLUDE: FILES AND DIRECTORIES
5
TIP: USE STRACE (AND SIMILAR TOOLS)
The strace tool provides an awesome way to see what programs are up
to. By running it, you can trace which system calls a program makes, see
the arguments and return codes, and generally get a very good idea of
what is going on.
The tool also takes some arguments which can be quite useful. For ex-
ample, −f follows any fork'd children too; -t reports the time of day
at each call; -e trace=open,close, read, write only traces calls to
those system calls and ignores all others. There are many other flags; read
the man pages and find out how to harness this wonderful tool.
39.4 Reading And Writing Files
Once we have some files, of course we might like to read or write them.
Let's start by reading an existing file. If we were typing at a command
line, we might just use the program cat to dump the contents of the file
to the screen.
prompt> echo hello > foo
prompt cat foo
hello
prompt>
In this code snippet, we redirect the output of the program echo to
the file foo, which then contains the word "hello" in it. We then use cat
to see the contents of the file. But how does the cat program access the
file foo?
To find this out, we'll use an incredibly useful tool to trace the sys-
tem calls made by a program. On Linux, the tool is called strace; other
systems have similar tools (see dtruss on a Mac, or truss on some older
UNIX variants). What strace does is trace every system call made by a
program while it runs, and dump the trace to the screen for you to see.
Here is an example of using strace to figure out what cat is doing
(some calls removed for readability):
prompt strace cat foo
...
open("foo", O_RDONLY|O_LARGEFILE)
read (3,
write (1,
hello
read (3,
close (3)
prompt>
"hello\n",
4096)
"hello\n", 6)
", 4096)
|| || ||
366
=
= 6
00
=
= 0
THREE
© 2008-23, ARPACI-DUSSEAU
EASY
PIECES/n File System Implementation
In this chapter, we introduce a simple file system implementation, known
as vsfs (the Very Simple File System). This file system is a simplified
version of a typical UNIX file system and thus serves to introduce some
of the basic on-disk structures, access methods, and various policies that
you will find in many file systems today.
The file system is pure software; unlike our development of CPU and
memory virtualization, we will not be adding hardware features to make
some aspect of the file system work better (though we will want to pay at-
tention to device characteristics to make sure the file system works well).
Because of the great flexibility we have in building a file system, many
different ones have been built, literally from AFS (the Andrew File Sys-
tem) [H+88] to ZFS (Sun's Zettabyte File System) [B07]. All of these file
systems have different data structures and do some things better or worse
than their peers. Thus, the way we will be learning about file systems is
through case studies: first, a simple file system (vsfs) in this chapter to
introduce most concepts, and then a series of studies of real file systems
to understand how they can differ in practice.
THE CRUX: HOW TO IMPLEMENT A SIMPLE FILE SYSTEM
How can we build a simple file system? What structures are needed
on the disk? What do they need to track? How are they accessed?
40.1 The Way To Think
To think about file systems, we usually suggest thinking about two
different aspects of them; if you understand both of these aspects, you
probably understand how the file system basically works.
The first is the data structures of the file system. In other words, what
types of on-disk structures are utilized by the file system to organize its
data and metadata? The first file systems we'll see (including vsfs below)
employ simple structures, like arrays of blocks or other objects, whereas
1
40 2
FILE SYSTEM IMPLEMENTATION
ASIDE: MENTAL MODELS OF FILE SYSTEMS
As we've discussed before, mental models are what you are really trying
to develop when learning about systems. For file systems, your mental
model should eventually include answers to questions like: what on-disk
structures store the file system's data and metadata? What happens when
a process opens a file? Which on-disk structures are accessed during a
read or write? By working on and improving your mental model, you
develop an abstract understanding of what is going on, instead of just
trying to understand the specifics of some file-system code (though that
is also useful, of course!).
more sophisticated file systems, like SGI's XFS, use more complicated
tree-based structures [S+96].
it
The second aspect of a file system is its access methods. How does
map the calls made by a process, such as open (), read(), write(),
etc., onto its structures? Which structures are read during the execution
of a particular system call? Which are written? How efficiently are all of
these steps performed?
If you understand the data structures and access methods of a file sys-
tem, you have developed a good mental model of how it truly works, a
key part of the systems mindset. Try to work on developing your mental
model as we delve into our first implementation.
40.2 Overall Organization
We now develop the overall on-disk organization of the data struc-
tures of the vsfs file system. The first thing we'll need to do is divide the
disk into blocks; simple file systems use just one block size, and that's
exactly what we'll do here. Let's choose a commonly-used size of 4 KB.
Thus, our view of the disk partition where we're building our file sys-
tem is simple: a series of blocks, each of size 4 KB. The blocks are ad-
dressed from 0 to N − 1, in a partition of size N 4-KB blocks. Assume we
have a really small disk, with just 64 blocks:
0
7 8
15 16
23 24
31
32
39 40
47 48
55 56
63
Let's now think about what we need to store in these blocks to build
a file system. Of course, the first thing that comes to mind is user data.
In fact, most of the space in any file system is (and should be) user data.
Let's call the region of the disk we use for user data the data region, and,
OPERATING
SYSTEMS
[VERSION 1.10]
WWW.OSTEP.ORG FILE SYSTEM IMPLEMENTATION
3
again for simplicity, reserve a fixed portion of the disk for these blocks,
say the last 56 of 64 blocks on the disk:
0
DD
7 8
DDDD DDDDDDDD DDDD|D|D|D|D
Data Region
15 16
Data Region
23 24
32
DIDIDIDIDIDIR
39 40
47 48
31
DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD
DIDIDIDIDIDI
55 56
63
As we learned about (a little) last chapter, the file system has to track
information about each file. This information is a key piece of metadata,
and tracks things like which data blocks (in the data region) comprise a
file, the size of the file, its owner and access rights, access and modify
times, and other similar kinds of information. To store this information,
file systems usually have a structure called an inode (we'll read more
about inodes below).
To accommodate inodes, we'll need to reserve some space on the disk
for them as well. Let's call this portion of the disk the inode table, which
simply holds an array of on-disk inodes. Thus, our on-disk image now
looks like this picture, assuming that we use 5 of our 64 blocks for inodes
(denoted by I's in the diagram):
0
DDD
32
Inodes
Data Region
23 24
DDDDDDDD DDDDDDDD DDDDDDDD
DIDIDIDIDIDIDIP 8
7 8
16
Data Region
31
DDDDD DDDDDDDD DDDDDDDD DDDDDDDD
39 40
47 48
55 56
63
We should note here that inodes are typically not that big, for example
128 or 256 bytes. Assuming 256 bytes per inode, a 4-KB block can hold 16
inodes, and our file system above contains 80 total inodes. In our simple
file system, built on a tiny 64-block partition, this number represents the
maximum number of files we can have in our file system; however, do
note that the same file system, built on a larger disk, could simply allocate
a larger inode table and thus accommodate more files.
Our file system thus far has data blocks (D), and inodes (I), but a few
things are still missing. One primary component that is still needed, as
you might have guessed, is some way to track whether inodes or data
blocks are free or allocated. Such allocation structures are thus a requisite
element in any file system.
Many allocation-tracking methods are possible, of course. For exam-
ple, we could use a free list that points to the first free block, which then
points to the next free block, and so forth. We instead choose a simple and
popular structure known as a bitmap, one for the data region (the data
bitmap), and one for the inode table (the inode bitmap). A bitmap is a
© 2008-23, ARPACI-DUSSEAU
THREE
EASY
PIECES 4
FILE SYSTEM IMPLEMENTATION
simple structure: each bit is used to indicate whether the corresponding
object/block is free (0) or in-use (1). And thus our new on-disk layout,
with an inode bitmap (i) and a data bitmap (d):
Inodes
0
7 8
[DIDIDIDIDIDIDID DIDIDIDIDDDDDDDDD|DIDID
Data Region
15 16
Data Region
23 24
47 48
31
DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD
32
39 40
55 56
63
You may notice that it is a bit of overkill to use an entire 4-KB block for
these bitmaps; such a bitmap can track whether 32K objects are allocated,
and yet we only have 80 inodes and 56 data blocks. However, we just use
an entire 4-KB block for each of these bitmaps for simplicity.
The careful reader (i.e., the reader who is still awake) may have no-
ticed there is one block left in the design of the on-disk structure of our
very simple file system. We reserve this for the superblock, denoted by
an S in the diagram below. The superblock contains information about
this particular file system, including, for example, how many inodes and
data blocks are in the file system (80 and 56, respectively in this instance),
where the inode table begins (block 3), and so forth. It will likely also
include a magic number of some kind to identify the file system type (in
this case, vsfs).
DDI
32
Inodes
7 8
DDDDDDDD DDDDDDDD DDDDDDDD
Data Region
15 16
Data Region
DIDIDIDID
23 24
47 48
55 56
ODDD DDDDDDDD DDDDDDDD DDD
39 40
31
ODD
63
Thus, when mounting a file system, the operating system will read
the superblock first, to initialize various parameters, and then attach the
volume to the file-system tree. When files within the volume are accessed,
the system will thus know exactly where to look for the needed on-disk
structures.
40.3 File Organization: The Inode
One of the most important on-disk structures of a file system is the
inode; virtually all file systems have a structure similar to this. The name
inode is short for index node, the historical name given to it in UNIX
[RT74] and possibly earlier systems, used because these nodes were orig-
inally arranged in an array, and the array indexed into when accessing a
particular inode.
OPERATING
SYSTEMS
WWW.OSTEP.ORG
[VERSION 1.10] FILE SYSTEM IMPLEMENTATION
5
ASIDE: DATA STRUCTURE — THE INODE
-
The inode is the generic name that is used in many file systems to de-
scribe the structure that holds the metadata for a given file, such as its
length, permissions, and the location of its constituent blocks. The name
goes back at least as far as UNIX (and probably further back to Multics
if not earlier systems); it is short for index node, as the inode number is
used to index into an array of on-disk inodes in order to find the inode
of that number. As we'll see, design of the inode is one key part of file
system design. Most modern systems have some kind of structure like
this for every file they track, but perhaps call them different things (such
as dnodes, fnodes, etc.).
Each inode is implicitly referred to by a number (called the i-number),
which we've earlier called the low-level name of the file. In vsfs (and
other simple file systems), given an i-number, you should directly be able
to calculate where on the disk the corresponding inode is located. For ex-
ample, take the inode table of vsfs as above: 20KB in size (five 4KB blocks)
and thus consisting of 80 inodes (assuming each inode is 256 bytes); fur-
ther assume that the inode region starts at 12KB (i.e, the superblock starts
at 0KB, the inode bitmap is at address 4KB, the data bitmap at 8KB, and
thus the inode table comes right after). In vsfs, we thus have the following
layout for the beginning of the file system partition (in closeup view):
The Inode Table (Closeup)
iblock 0iblock 1iblock 2 iblock 3 iblock 4
012316171819323334354849505164656667
456720212223363738395253545568697071
Super i-bmap d-bmap 9 10 11 24 25 26 27 40 41 42 4356 57 58 59 72737475
OKB
4KB
8KB
12KB
1213141528293031 44 45 46 47 6061626376777879
16KB 20KB 24KB 28KB 32KB
To read inode number 32, the file system would first calculate the off-
set into the inode region (32 · sizeof(inode) or 8192), add it to the start
address of the inode table on disk (inodeStartAddr = 12KB), and thus
arrive upon the correct byte address of the desired block of inodes: 20KB.
Recall that disks are not byte addressable, but rather consist of a large
number of addressable sectors, usually 512 bytes. Thus, to fetch the block
of inodes that contains inode 32, the file system would issue a read to sec-
tor 20×1024
or 40, to fetch the desired inode block. More generally, the
sector address sector of the inode block can be calculated as follows:
blk
sector =
512 '
=
(inumber * sizeof (inode_t)) / blockSize;
((blk blockSize) + inodeStartAddr) / sectorSize;
Inside each inode is virtually all of the information you need about a
file: its type (e.g., regular file, directory, etc.), its size, the number of blocks
allocated to it, protection information (such as who owns the file, as well
© 2008-23, ARPACI-DUSSEAU
THREE
EASY
PIECES