Friday, September 30, 2005

Lose vs. loose

Some days ago, I misspelled lose as loose in one of my posts. One of the readers pointed me out the error and suggested to look at this comic strip from Queen of Wands (never read it before). I think I'll never make the same mistake again ;-)

Oh, and BTW, yesterday's strip is kind of funny too; you ought to read it.

Monday, September 26, 2005

Cleaning up a Debian box

I have to confess that, for a long time, I was a big Debian fan; this changed when I switched to the BSDs, but I still think that it is (one of) the best GNU/Linux distribution. This, and because its excellent hardware support, is why it's the main system on my iBook G3.

As a user, it's very common to install lots of packages just to try them, followed by a quick deinstallation when you realize they don't suit your needs. Unfortunately, following common procedures often leaves useless garbage in the file system.

The first annoying thing are installed packages that provide libraries but which aren't used by any other package. These are often called orphan packages and can be listed by using the deborphan utility (which must be installed first). Assuming you have it, do the following to remove all orphaned libraries:

# dpkg --purge $(deborphan)

Note that this in turn may obsolete other libraries, so rerun this command until deborphan prints nothing. Also, please bear in mind that this is dangerous if you have installed software from other means — not from Debian packages — as those libraries may still be used.

Then we have obsolete configuration files. Whenever you remove a package, all its files are removed except the configuration ones, in case you did some modification you want to preserve during an eventual reinstallation. This may be OK in some scenarios but, personally, when I delete something I want it really gone. Here comes dpkg's --purge option to help, which removes all leftovers of a previously installed package.

But how to purge all configuration files from packages that were once installed? Just do the following:

# dpkg --purge $(dpkg --list | grep ^rc | awk '{ print $2; }')

This constructs a list of packages which are removed but still have their configuration files installed (denoted by the ^rc pattern) and passes this list to the package management utility to purge all obsolete files.

Friday, September 23, 2005

NFS exports lists rototill

After two weeks of work, the NFS exports lists rototill that I briefly outlined in this past post is finished and committed into NetBSD's source tree. Believe it or not, the whole set of changes was triggered by a XXX mark in mountd(8)'s code (in other words, fixing code marked as such is not always trivial).

In the past, when a file system wanted to support NFS, it had to include two fields in a fixed position of its mount arguments structure due to the broken way in which mountd(8) handled the mount(2) calls. Furthermore, each file system had to deal internally with the NFS exports list, duplicating code among all NFS-aware file systems; this feature was clearly generic, so it had to be placed in an upper generalization level. At last, you had to use the mount(2) system call in a wired way to change the exports.

This was wrong because it broke extensibility and code cleanliness: you couldn't add a new NFS-aware file system without modifying and rebuilding mountd(8). There was also a lot of code duplication and the interface was just ugly from a design point of view. It looked like a quick band-aid over old code to add support for more than one NFS-aware file system.

The changes I just committed have centralized all NFS exports lists handling in a single place and it has been removed from the kernel's core. This stuff is now only compiled in when the NFSSERVER option is enabled, reducing the final kernel size by around 5KB.

The next logical step is to rework mountd(8) to update the whole exports list for a given mount point atomically; this wasn't possible at all before. The new kernel interface, introduced as a new command to the nfssvc(2), has been designed to support this functionality, although it's not used yet due to the big changes required in the userland utility. I won't be working on this anytime soon, as I have other stuff to do before it (hence the entry I added to src/TODO).

If you want to know a bit more, the commit message contains a detailed list of all the changes.

PS: I have to thank Google's Summer of Code 2005 again. I couldn't have found this issue nor had enough knowledge to fix it if I hadn't been working on tmpfs during the summer.

Thursday, September 22, 2005

Quality of digital signals

I have seen many times people thinking that a signal has better quality if it's in digital form rather than in analog form, yet this is false. When you convert an analog signal to a digital format, you always lose quality because you are converting a continuous signal to a discrete one. Let's see a little example:

Suppose we want to digitize a 20KHz analog audio signal so that we can store it on disk. Following the Nyquist theorem, we must take a minimum of 40K samples per second; this guarantees that we can recreate the original signal in its exact form (i.e., maximum quality).

The lost of quality arises at this point: how can we represent each sample without losing precision (assuming, of course, that the sample is accurate)? Samples are measured in a continuous scale, but in order to store them we must round their value up or down to a discrete value we can represent (think about representing a real number, which leads to the same problem). We'd need infinite precision to store each sample in its exact form, which means infinite disk space. Not possible.

Note that the example above assumes a PCM modulation technique where each sample is codified and stored in a fixed amount of bits. As you can imagine, any other technique will suffer from this problem.

This post comes after today's SPD class, an optional course I'm taking this semester at university. It's all about the low level details of public data networks. Amazing stuff, IMHO ;-)

Sunday, September 18, 2005

Linker's link sets

I don't know about other linkers, but GNU ld provides a very useful feature: link sets. A link set is a list of symbols constructed during link time which can then be inspected in regular code. This is very interesting in situations when you want to initialize several subsystems from a centralized place but don't know which of these will be available; that is, you don't know which ones will be in the final binary.

To illustrate this feature, let's see how NetBSD manages file system initialization. In NetBSD, you can configure which file systems you want in your kernel binary and which not. Each file system has a vfs_init hook that must be called during system initialization but... how to do this in a clean and extensible way? For a moment, imagine we didn't have link sets. In this case, NetBSD would need a function similar to the following:

void
init_file_systems(void)
{
#ifdef UFS
ufs_init();
#endif

#ifdef MSDOSFS
msdosfs_init();
#endif

...

#ifdef TMPFS
tmpfs_init();
#endif
}

Isn't this ugly? Very much, although one might support the visual ugliness. However, the real problem is that, in order to add a new file system, you need to edit other parts of the system; this might seem acceptable in an open source project, but what if the initialization code was closed source? You couldn't add your file system. This clearly goes against extensibility.

Using link sets, this problem goes away: each independent module is able to add itself to a global list of symbols. Continuing our file systems example, take a look to the vfsinit function in kern/vfs_init.c. The first thing you see there is the declaration (__link_set_decl) of a new link set, called vfsops, that will hold pointers to struct vfsops objects. Then, the code uses the __link_set_foreach construction to iterate over the list, correctly initializing all the file systems. The following is an extract of this, with many details removed:

void
vfsinit(void)
{
__link_set_decl(vfsops, struct vfsops);
struct vfsops * const *vfsp; /* Iterator */

__link_set_foreach(vfsp, vfsops) {
(*vfsp)->vfs_init();
}
}

At last, how does a module add itself to the link set? Using the __link_set_add_data operation, which takes the name of the link set as its first argument and the object to be added as the second one. E.g.:

struct vfsops barops = {
...
ufs_init,
...
};
__link_set_add_data(vfsops, barops);

As you can see, using link sets the kernel does not need to know at all which file systems are compiled into it, yet the initialization code will be executed.

Sunday, September 11, 2005

Interface to change NFS exports

While adding NFS support to tmpfs, I found how NetBSD currently manages NFS exports from userland. The interface is, IMHO, scary. As NetBSD aims for clean code and design, it must be fixed.

See my mail to the tech-kern@ mailing list for more details on the issue and a preliminary patch.

Saturday, September 10, 2005

tmpfs: Project merged into NetBSD

After listening to many queries from developers asking when tmpfs could be integrated into NetBSD, I finally imported the code into the CVS repository. I'm really happy about this :-) Development will be simplified from now on and it will be a lot easier for interested parties to test the code. Please read the announcement for more information.

I'd like to comment now some of the improvements I've been doing during the past days, which mostly addressed optimization. I started by removing the storage of . and .. directory entries from directories, generating them on the fly when requested. This was done for three reasons. First, to remove redundancy: as . points to the directory itself, the entry can always be generated; similarly, .. can be generated from the pointer to the parent directory stored in the nodes. Secondly, to simplify the code: there were multiple assertions to ensure that these entries were correct and there was code to update them on file-system changes; this code was hard to understand and, as you can see, avoidable. Lastly, to reduce memory consumption: this removed around 1KB of storage from each directory (given to a change I've done recently, this gain is lower, because entries are now a lot smaller, but still this saves 40 bytes from each directory).

Then I fixed a long-standing and very important issue: the assignment of unique node numbers and generation numbers to nodes. Among other reasons, this was done to be NFS-friendly. See my previous post on NFS file handles for more details.

At last, I removed fixed-size string buffers from nodes and directory entries, replacing them with dynamically allocated memory from a string pool. Strings are now stored more efficiently and waste far less space. Note that, before this change, each directory entry occupied around 255 bytes and each node took more than 1KB. As an example, the simple creation of 10K empty files previously required around 20MB of memory, whereas it now needs 1.5MB. This should be reduced even more, but, as you can see, this is a lot better than before.

And of course, several bug fixes and miscellaneous improvements were done all around.

NFS file handles

NFS uses a structure called a file handle to uniquely identify exported files. When a client requests to access a file, the server constructs a file handle that identifies it; from this point on, this identifier will be used in all communications between the server and the client to access that specific file. (Look at the fs_vptofh and fs_fhtovp hooks in NetBSD's VFS layer to see how this mapping works.)

In order to identify a file univocally, you need three things: the file-system identifier (i.e., something unique to each mount point), the node number (e.g., the inode number in FFS terminology) and a generation number. It is clear why the first two are needed, but the generation number may be confusing. In order to explain this last concept, let's see an example:

Suppose you have a file using the 1234 inode of the root file-system; let's call this foo. Now, you delete foo and create a new file, called bar, which ends up reusing the 1234 inode. These two files are clearly different, but they could look the same from the point of view of a file handle if you used the file-system number and the inode number exclusively. Therefore, we need a third component, the generation number, that makes these two files different. This component is incremented each time an inode is reused for a different file.

So far, so good. But now we get to the ugly part: given a file-system number, a node number and a generation number, you can generate a file handle that refers to any file within the exported file-system. If these values were available to everybody, a malicious client could easily generate file handles to access files he is not allowed to touch. This is why the generation number is hidden to user space: if you apply stat(2) to a file, you'll always get a zero value in the st_gen field (some systems may return the correct number if you are root). Similarly, the getfh(2) system call is restricted to the super-user.

But can't a user guess the generation number? Yes, he can. This is why they are often generated based on a random seed, which makes the guessing difficult. See fsirand(8) for more information.

All this explains why there is no such thing as exporting a subtree in the NFS world. When you export a directory from a file-system, you are implicitly exporting the whole mount point because malicious file handles could be used to access files outside the subtree.

Thanks go to William Studenmund for explaining me this during tmpfs development.

Friday, September 09, 2005

FFS, LFS and MFS

The designers of the Fast File System (FFS) organized it in two different layers aiming for flexibility and code reuse; this happened in 4.4BSD (or maybe earlier, I don't know for sure). The upper layer communicates with the VFS and vnode code while the lower layer accesses the underlying devices and only receives requests from the upper layer. It goes like this:

VFS/vnode calls <-> FFS upper layer <-> FFS lower layer <-> backing device

The upper layer handles all logical details of a Unix file-system, abstracting, among many other things, inode representation and management, tree manipulations and verification of credentials. It also lays out data in different blocks and passes them to the lower layer.

The lower layer receives the blocks generated by the upper layer and lays them out on the physical device (e.g., a disk). It is also responsible of reading the blocks requested by the upper layer and transferring them to it.

Thanks to this design, three different lower layers were implemented, providing three file-systems:

  • Fast File System (FFS): It assumes that the average file size is small and that some files (those in the same directory) will be accessed all at once. Given these assumptions, it divides the disk surface in different cylinder groups and lays out blocks that may be accessed together in the same group. This reduces expensive seeks.
  • Log-structured File System (LFS): It assumes that the system bottleneck are writes to the disk, so it packs all writes in a concrete area of the disk, treating the overall space as an append-only log (no seeks required).
  • Memory File System (MFS): Reserves a contiguous region of memory and lays out blocks in it. This is why MFS is considered inefficient, because it is using the concepts of an on-disk file-system over memory.

The different layers are clearly visible in the source code. The src/sys/ufs/ufs directory contains the upper layer's code while src/sys/ufs/ffs, src/sys/ufs/lfs and src/sys/ufs/mfs contain the three respective lower layers.

Note that the above is an over-simplification of the Real Thing (TM). If you want detailed information, read Chapter 6 of The Design and Implementation of the 4.4 BSD Operating System.

Thursday, September 01, 2005

SoC: The end

So... the deadline for Google's Summer of Code 2005 program arrived some time between yesterday and today (don't know exactly due to timezones). The final results from my side: a functional memory file-system (not efficient yet) for NetBSD named tmpfs, as well as the beginnings of a book/article on file-system development under NetBSD.

As regards the file-system itself, its code can be found in the CVS repository. Despite it has some bugs and misfeatures, it is functional from the user's point of view. The file-system's code is around 4000 lines (plus 500 from the mount utility) and the regression test suite around 2000 (half of which are license texts). I know these numbers are low, but man... this is the hardest code I've ever written (mostly due to lack of documentation and having to reverse engineer existing stuff). I've also written a document describing tmpfs' internals (as said in the initial proposal), which is available in the form of a manual page in the repository and is around 700 lines long.

With respect to the documentation, it is not yet available online. I'm still discussing with my mentor (Bill Studenmund) how it shall be published and I don't like it in its current form to make it public (still has many mistakes and is heavily disorganized... and is already 2500 lines long). It looks like we both agree that it should be in the form of a (online) book, leaving room for future additions (e.g., UVM, device drivers, etc.). However, I will also try to write a little article explaining all the steps I followed to write tmpfs. Writing an independent article will be better because it'll let me focus on a single task and it'll be possible to mix information from several subsystems without making it look messy. But there is still a long way to go to have these done.

I would like to make it clear that despite the program has ended, I will continue working on all this stuff until it is ready to be integrated into NetBSD's source tree; all this work musn't have been in vain! Oh, and having been part of this program has been great. It has forced me to work on something I've always wanted to do (kernel development) but never found time to learn about it... and I've done paid development for the best free operating system! ;-) I have to say that I've learned a damn lot in these two months.

And today, just as a little prize for the program's completion, I've bought myself a copy of Doom 3 (because I decided to go legal a while ago, yay!) and this month's Dr. Dobb's Journal edition. This is the first time I buy this magazine. Why? They will be publishing a set of articles describing SoC projects, so I want to know about their style — which, as far as I've seen, is very technical, something that is nice. Furthermore, this edition has some articles that I found interesting when I first saw them in the web site; i.e., C++, STL, and custom pool allocators and C++ exceptions and the Linux kernel.