Search for question
Question

You are going to write an extended version of chapter 40 (File System Implementation). You are going to add content to 40.2-40.4 that corresponds to an example file system. Note that you may need to refer back to chapter 39 for the representation of the directories. After each section 40.2-40.4 you are going to add an example that shows the following file structures. In 40.1 you will introduce the example, and discuss the files from the perspective of the user. In 40.2-40.4 you will discuss the aspects of that 3 files (and the directories to support them). -rwxr-xr-x -rw-r--r-- -rw-r-- /users/<your login>/stuff.sh /users/<your login>/stuff.txt /users/<your login>/stuff/stuff.txt With the following content: stuff.sh #!/bin/bash ls -la stuff.txt (1) There is stuff here stuff.txt (2) There is also stuff here Please hand in: A .pdf or .docx file/n Interlude: Files and Directories Thus far we have seen the development of two key operating system ab- stractions: the process, which is a virtualization of the CPU, and the ad- dress space, which is a virtualization of memory. In tandem, these two abstractions allow a program to run as if it is in its own private, isolated world; as if it has its own processor (or processors); as if it has its own memory. This illusion makes programming the system much easier and thus is prevalent today not only on desktops and servers but increasingly on all programmable platforms including mobile phones and the like. In this section, we add one more critical piece to the virtualization puz- zle: persistent storage. A persistent-storage device, such as a classic hard disk drive or a more modern solid-state storage device, stores informa- tion permanently (or at least, for a long time). Unlike memory, whose contents are lost when there is a power loss, a persistent-storage device keeps such data intact. Thus, the OS must take extra care with such a device: this is where users keep data that they really care about. CRUX: HOW TO MANAGE A PERSISTENT DEVICE How should the OS manage a persistent device? What are the APIs? What are the important aspects of the implementation? Thus, in the next few chapters, we will explore critical techniques for managing persistent data, focusing on methods to improve performance and reliability. We begin, however, with an overview of the API: the in- terfaces you'll expect to see when interacting with a UNIX file system. 39.1 Files And Directories Two key abstractions have developed over time in the virtualization of storage. The first is the file. A file is simply a linear array of bytes, each of which you can read or write. Each file has some kind of low-level name, usually a number of some kind; often, the user is not aware of this 1 39 2 INTERLUDE: FILES AND DIRECTORIES foo bar bar.txt bar foo bar.txt Figure 39.1: An Example Directory Tree name (as we will see). For historical reasons, the low-level name of a file is often referred to as its inode number (i-number). We'll be learning a lot more about inodes in future chapters; for now, just assume that each file has an inode number associated with it. In most systems, the OS does not know much about the structure of the file (e.g., whether it is a picture, or a text file, or C code); rather, the responsibility of the file system is simply to store such data persistently on disk and make sure that when you request the data again, you get what you put there in the first place. Doing so is not as simple as it seems! The second abstraction is that of a directory. A directory, like a file, also has a low-level name (i.e., an inode number), but its contents are quite specific: it contains a list of (user-readable name, low-level name) pairs. For example, let's say there is a file with the low-level name "10", and it is referred to by the user-readable name of "foo". The directory that "foo" resides in thus would have an entry ("foo", "10″) that maps the user-readable name to the low-level name. Each entry in a directory refers to either files or other directories. By placing directories within other directories, users are able to build an arbitrary directory tree (or directory hierarchy), under which all files and directories are stored. The directory hierarchy starts at a root directory (in UNIX-based sys- tems, the root directory is simply referred to as /) and uses some kind of separator to name subsequent sub-directories until the desired file or directory is named. For example, if a user created a directory foo in the root directory /, and then created a file bar.txt in the directory foo, we could refer to the file by its absolute pathname, which in this case would be /foo/bar.txt. See Figure 39.1 for a more complex directory tree; valid directories in the example are /, /foo, /bar, /bar/bar, /bar/foo and valid files are/foo/bar.txt and /bar/foo/bar.txt. OPERATING SYSTEMS [VERSION 1.10] WWW.OSTEP.ORG INTERLUDE: FILES AND DIRECTORIES 3 TIP: THINK Carefully ABOUT NAMING Naming is an important aspect of computer systems [SK09]. In UNIX systems, virtually everything that you can think of is named through the file system. Beyond just files, devices, pipes, and even processes [K84] can be found in what looks like a plain old file system. This uniformity of naming eases your conceptual model of the system, and makes the system simpler and more modular. Thus, whenever creating a system or interface, think carefully about what names you are using. Directories and files can have the same name as long as they are in dif- ferent locations in the file-system tree (e.g., there are two files named bar.txt in the figure, /foo/bar.txt and /bar/foo/bar.txt). You may also notice that the file name in this example often has two parts: bar and txt, separated by a period. The first part is an arbitrary name, whereas the second part of the file name is usually used to indi- cate the type of the file, e.g., whether it is C code (e.g., . c), or an image (e.g., . jpg), or a music file (e.g., .mp3). However, this is usually just a convention: there is usually no enforcement that the data contained in a file named main.c is indeed C source code. Thus, we can see one great thing provided by the file system: a conve- nient way to name all the files we are interested in. Names are important in systems as the first step to accessing any resource is being able to name it. In UNIX systems, the file system thus provides a unified way to access files on disk, USB stick, CD-ROM, many other devices, and in fact many other things, all located under the single directory tree. 39.2 The File System Interface Let's now discuss the file system interface in more detail. We'll start with the basics of creating, accessing, and deleting files. You may think this is straightforward, but along the way we'll discover the mysterious call that is used to remove files, known as unlink (). Hopefully, by the end of this chapter, this mystery won't be so mysterious to you! 39.3 Creating Files We'll start with the most basic of operations: creating a file. This can be accomplished with the open system call; by calling open () and passing it the O_CREAT flag, a program can create a new file. Here is some exam- ple code to create a file called "foo" in the current working directory: int fd = open ("foo", O_CREAT | O_WRONLY | O__TRUNC, S_IRUSR | S_IWUSR); THREE © 2008-23, ARPACI-DUSSEAU EASY PIECES 4 INTERLUDE: FILES AND DIRECTORIES ASIDE: THE CREAT ( ) SYSTEM CALL The older way of creating a file is to call creat (), as follows: // option: add second flag to set permissions int fd creat ("foo"); = You can think of creat () as open () with the following flags: O_CREAT | O_WRONLY | O_TRUNC. Because open() can create a file, the usage of creat () has somewhat fallen out of favor (indeed, it could just be implemented as a library call to open () ); however, it does hold a special place in UNIX lore. Specifically, when Ken Thompson was asked what he would do differently if he were redesigning UNIX, he replied: “I'd spell creat with an e.” The routine open () takes a number of different flags. In this exam- ple, the second parameter creates the file (O_CREAT) if it does not exist, ensures that the file can only be written to (O_WRONLY), and, if the file already exists, truncates it to a size of zero bytes thus removing any exist- ing content (O_TRUNC). The third parameter specifies permissions, in this case making the file readable and writable by the owner. One important aspect of open () is what it returns: a file descriptor. A file descriptor is just an integer, private per process, and is used in UNIX systems to access files; thus, once a file is opened, you use the file de- scriptor to read or write the file, assuming you have permission to do so. In this way, a file descriptor is a capability [L84], i.e., an opaque handle that gives you the power to perform certain operations. Another way to think of a file descriptor is as a pointer to an object of type file; once you have such an object, you can call other "methods" to access the file, like read() and write () (we'll see how to do so below). As stated above, file descriptors are managed by the operating system on a per-process basis. This means some kind of simple structure (e.g., an array) is kept in the proc structure on UNIX systems. Here is the relevant piece from the xv6 kernel [CK+08]: struct proc { struct file *ofile [NOFILE]; // Open files ; A simple array (with a maximum of NOFILE open files), indexed by the file descriptor, tracks which files are opened on a per-process basis. Each entry of the array is actually just a pointer to a struct file, which will be used to track information about the file being read or written; we'll discuss this further below. OPERATING SYSTEMS [VERSION 1.10] WWW.OSTEP.ORG INTERLUDE: FILES AND DIRECTORIES 5 TIP: USE STRACE (AND SIMILAR TOOLS) The strace tool provides an awesome way to see what programs are up to. By running it, you can trace which system calls a program makes, see the arguments and return codes, and generally get a very good idea of what is going on. The tool also takes some arguments which can be quite useful. For ex- ample, −f follows any fork'd children too; -t reports the time of day at each call; -e trace=open,close, read, write only traces calls to those system calls and ignores all others. There are many other flags; read the man pages and find out how to harness this wonderful tool. 39.4 Reading And Writing Files Once we have some files, of course we might like to read or write them. Let's start by reading an existing file. If we were typing at a command line, we might just use the program cat to dump the contents of the file to the screen. prompt> echo hello > foo prompt cat foo hello prompt> In this code snippet, we redirect the output of the program echo to the file foo, which then contains the word "hello" in it. We then use cat to see the contents of the file. But how does the cat program access the file foo? To find this out, we'll use an incredibly useful tool to trace the sys- tem calls made by a program. On Linux, the tool is called strace; other systems have similar tools (see dtruss on a Mac, or truss on some older UNIX variants). What strace does is trace every system call made by a program while it runs, and dump the trace to the screen for you to see. Here is an example of using strace to figure out what cat is doing (some calls removed for readability): prompt strace cat foo ... open("foo", O_RDONLY|O_LARGEFILE) read (3, write (1, hello read (3, close (3) prompt> "hello\n", 4096) "hello\n", 6) ", 4096) || || || 366 = = 6 00 = = 0 THREE © 2008-23, ARPACI-DUSSEAU EASY PIECES/n File System Implementation In this chapter, we introduce a simple file system implementation, known as vsfs (the Very Simple File System). This file system is a simplified version of a typical UNIX file system and thus serves to introduce some of the basic on-disk structures, access methods, and various policies that you will find in many file systems today. The file system is pure software; unlike our development of CPU and memory virtualization, we will not be adding hardware features to make some aspect of the file system work better (though we will want to pay at- tention to device characteristics to make sure the file system works well). Because of the great flexibility we have in building a file system, many different ones have been built, literally from AFS (the Andrew File Sys- tem) [H+88] to ZFS (Sun's Zettabyte File System) [B07]. All of these file systems have different data structures and do some things better or worse than their peers. Thus, the way we will be learning about file systems is through case studies: first, a simple file system (vsfs) in this chapter to introduce most concepts, and then a series of studies of real file systems to understand how they can differ in practice. THE CRUX: HOW TO IMPLEMENT A SIMPLE FILE SYSTEM How can we build a simple file system? What structures are needed on the disk? What do they need to track? How are they accessed? 40.1 The Way To Think To think about file systems, we usually suggest thinking about two different aspects of them; if you understand both of these aspects, you probably understand how the file system basically works. The first is the data structures of the file system. In other words, what types of on-disk structures are utilized by the file system to organize its data and metadata? The first file systems we'll see (including vsfs below) employ simple structures, like arrays of blocks or other objects, whereas 1 40 2 FILE SYSTEM IMPLEMENTATION ASIDE: MENTAL MODELS OF FILE SYSTEMS As we've discussed before, mental models are what you are really trying to develop when learning about systems. For file systems, your mental model should eventually include answers to questions like: what on-disk structures store the file system's data and metadata? What happens when a process opens a file? Which on-disk structures are accessed during a read or write? By working on and improving your mental model, you develop an abstract understanding of what is going on, instead of just trying to understand the specifics of some file-system code (though that is also useful, of course!). more sophisticated file systems, like SGI's XFS, use more complicated tree-based structures [S+96]. it The second aspect of a file system is its access methods. How does map the calls made by a process, such as open (), read(), write(), etc., onto its structures? Which structures are read during the execution of a particular system call? Which are written? How efficiently are all of these steps performed? If you understand the data structures and access methods of a file sys- tem, you have developed a good mental model of how it truly works, a key part of the systems mindset. Try to work on developing your mental model as we delve into our first implementation. 40.2 Overall Organization We now develop the overall on-disk organization of the data struc- tures of the vsfs file system. The first thing we'll need to do is divide the disk into blocks; simple file systems use just one block size, and that's exactly what we'll do here. Let's choose a commonly-used size of 4 KB. Thus, our view of the disk partition where we're building our file sys- tem is simple: a series of blocks, each of size 4 KB. The blocks are ad- dressed from 0 to N − 1, in a partition of size N 4-KB blocks. Assume we have a really small disk, with just 64 blocks: 0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 Let's now think about what we need to store in these blocks to build a file system. Of course, the first thing that comes to mind is user data. In fact, most of the space in any file system is (and should be) user data. Let's call the region of the disk we use for user data the data region, and, OPERATING SYSTEMS [VERSION 1.10] WWW.OSTEP.ORG FILE SYSTEM IMPLEMENTATION 3 again for simplicity, reserve a fixed portion of the disk for these blocks, say the last 56 of 64 blocks on the disk: 0 DD 7 8 DDDD DDDDDDDD DDDD|D|D|D|D Data Region 15 16 Data Region 23 24 32 DIDIDIDIDIDIR 39 40 47 48 31 DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD DIDIDIDIDIDI 55 56 63 As we learned about (a little) last chapter, the file system has to track information about each file. This information is a key piece of metadata, and tracks things like which data blocks (in the data region) comprise a file, the size of the file, its owner and access rights, access and modify times, and other similar kinds of information. To store this information, file systems usually have a structure called an inode (we'll read more about inodes below). To accommodate inodes, we'll need to reserve some space on the disk for them as well. Let's call this portion of the disk the inode table, which simply holds an array of on-disk inodes. Thus, our on-disk image now looks like this picture, assuming that we use 5 of our 64 blocks for inodes (denoted by I's in the diagram): 0 DDD 32 Inodes Data Region 23 24 DDDDDDDD DDDDDDDD DDDDDDDD DIDIDIDIDIDIDIP 8 7 8 16 Data Region 31 DDDDD DDDDDDDD DDDDDDDD DDDDDDDD 39 40 47 48 55 56 63 We should note here that inodes are typically not that big, for example 128 or 256 bytes. Assuming 256 bytes per inode, a 4-KB block can hold 16 inodes, and our file system above contains 80 total inodes. In our simple file system, built on a tiny 64-block partition, this number represents the maximum number of files we can have in our file system; however, do note that the same file system, built on a larger disk, could simply allocate a larger inode table and thus accommodate more files. Our file system thus far has data blocks (D), and inodes (I), but a few things are still missing. One primary component that is still needed, as you might have guessed, is some way to track whether inodes or data blocks are free or allocated. Such allocation structures are thus a requisite element in any file system. Many allocation-tracking methods are possible, of course. For exam- ple, we could use a free list that points to the first free block, which then points to the next free block, and so forth. We instead choose a simple and popular structure known as a bitmap, one for the data region (the data bitmap), and one for the inode table (the inode bitmap). A bitmap is a © 2008-23, ARPACI-DUSSEAU THREE EASY PIECES 4 FILE SYSTEM IMPLEMENTATION simple structure: each bit is used to indicate whether the corresponding object/block is free (0) or in-use (1). And thus our new on-disk layout, with an inode bitmap (i) and a data bitmap (d): Inodes 0 7 8 [DIDIDIDIDIDIDID DIDIDIDIDDDDDDDDD|DIDID Data Region 15 16 Data Region 23 24 47 48 31 DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD 32 39 40 55 56 63 You may notice that it is a bit of overkill to use an entire 4-KB block for these bitmaps; such a bitmap can track whether 32K objects are allocated, and yet we only have 80 inodes and 56 data blocks. However, we just use an entire 4-KB block for each of these bitmaps for simplicity. The careful reader (i.e., the reader who is still awake) may have no- ticed there is one block left in the design of the on-disk structure of our very simple file system. We reserve this for the superblock, denoted by an S in the diagram below. The superblock contains information about this particular file system, including, for example, how many inodes and data blocks are in the file system (80 and 56, respectively in this instance), where the inode table begins (block 3), and so forth. It will likely also include a magic number of some kind to identify the file system type (in this case, vsfs). DDI 32 Inodes 7 8 DDDDDDDD DDDDDDDD DDDDDDDD Data Region 15 16 Data Region DIDIDIDID 23 24 47 48 55 56 ODDD DDDDDDDD DDDDDDDD DDD 39 40 31 ODD 63 Thus, when mounting a file system, the operating system will read the superblock first, to initialize various parameters, and then attach the volume to the file-system tree. When files within the volume are accessed, the system will thus know exactly where to look for the needed on-disk structures. 40.3 File Organization: The Inode One of the most important on-disk structures of a file system is the inode; virtually all file systems have a structure similar to this. The name inode is short for index node, the historical name given to it in UNIX [RT74] and possibly earlier systems, used because these nodes were orig- inally arranged in an array, and the array indexed into when accessing a particular inode. OPERATING SYSTEMS WWW.OSTEP.ORG [VERSION 1.10] FILE SYSTEM IMPLEMENTATION 5 ASIDE: DATA STRUCTURE — THE INODE - The inode is the generic name that is used in many file systems to de- scribe the structure that holds the metadata for a given file, such as its length, permissions, and the location of its constituent blocks. The name goes back at least as far as UNIX (and probably further back to Multics if not earlier systems); it is short for index node, as the inode number is used to index into an array of on-disk inodes in order to find the inode of that number. As we'll see, design of the inode is one key part of file system design. Most modern systems have some kind of structure like this for every file they track, but perhaps call them different things (such as dnodes, fnodes, etc.). Each inode is implicitly referred to by a number (called the i-number), which we've earlier called the low-level name of the file. In vsfs (and other simple file systems), given an i-number, you should directly be able to calculate where on the disk the corresponding inode is located. For ex- ample, take the inode table of vsfs as above: 20KB in size (five 4KB blocks) and thus consisting of 80 inodes (assuming each inode is 256 bytes); fur- ther assume that the inode region starts at 12KB (i.e, the superblock starts at 0KB, the inode bitmap is at address 4KB, the data bitmap at 8KB, and thus the inode table comes right after). In vsfs, we thus have the following layout for the beginning of the file system partition (in closeup view): The Inode Table (Closeup) iblock 0iblock 1iblock 2 iblock 3 iblock 4 012316171819323334354849505164656667 456720212223363738395253545568697071 Super i-bmap d-bmap 9 10 11 24 25 26 27 40 41 42 4356 57 58 59 72737475 OKB 4KB 8KB 12KB 1213141528293031 44 45 46 47 6061626376777879 16KB 20KB 24KB 28KB 32KB To read inode number 32, the file system would first calculate the off- set into the inode region (32 · sizeof(inode) or 8192), add it to the start address of the inode table on disk (inodeStartAddr = 12KB), and thus arrive upon the correct byte address of the desired block of inodes: 20KB. Recall that disks are not byte addressable, but rather consist of a large number of addressable sectors, usually 512 bytes. Thus, to fetch the block of inodes that contains inode 32, the file system would issue a read to sec- tor 20×1024 or 40, to fetch the desired inode block. More generally, the sector address sector of the inode block can be calculated as follows: blk sector = 512 ' = (inumber * sizeof (inode_t)) / blockSize; ((blk blockSize) + inodeStartAddr) / sectorSize; Inside each inode is virtually all of the information you need about a file: its type (e.g., regular file, directory, etc.), its size, the number of blocks allocated to it, protection information (such as who owns the file, as well © 2008-23, ARPACI-DUSSEAU THREE EASY PIECES

Fig: 1