Pages

Monday, June 17, 2013

Readability: Mind your typos and grammar

I can't claim to be an expert writer — not in English, and not even in my native languages — but I do put a lot of attention to whatever I write: emails, presentations, letters and, of course, code. As it turns out, code has prose in it most of the time in the form of comments and documentation.

So, when you write any text in your code, engrave this in mind: any obvious typos and/or any grammar mistake very quickly denote sloppiness in what you wrote. If that is the impression the reader gets from your comments... why would he trust that you have been less sloppy when writing the code itself? Let me tell you that they won't, with good reason.

Some details to pay attention to that I keep encountering over and over again while reviewing code:

  • Write full sentences, always. That means using capitalization where due and sentence-ending periods.
  • Pay attention to proper names and capitalize them accordingly.
  • Commas are not sentence terminators. Use a semicolon or a period to connect full, standalone sentences.
  • Enable the spellchecker in your code editor, if available. This will catch an awful lot of typos.

But the most useful thing you can train yourself to do is: read whatever you have written twice at the very minimum. Most obvious errors will show up on those subsequent reads and you will have a chance to fix them. And yes, this applies to emails as well.

That's it for today on readability. Yes, this post is shorter than usual but a) it had to be said and b) there is not much more to add!

Saturday, June 15, 2013

Lutok 0.3 released

Lutok 0.3 was released yesterday evening. The packages in pkgsrc-current, Fedora 19 and Fedora rawhide have all been updated accordingly.

The major highlight of this new release is support for Lua 5.2 while retaining backwards compatibility with Lua 5.1. The incompatible changes between 5.1 and 5.2 only affected a small subset of the functionality in Lutok, which made this dual support possible.

For those that don't know what this project is about: Lutok is a lightweight C++ API for Lua. It is lightweight because it's an almost 1:1 mapping of the C API into C++. However, the major design criterion behind Lutok is to provide an interface that is fully C++ native and that is safe to use in the face of exceptions. For this reason, Lutok is not as performant as the native C library, which is OK for many use cases out there.

Be aware that Lutok 0.3 is backwards-incompatible with 0.2 at the API and ABI levels. This is just an unfortunate effect of Lua 5.1 and 5.2 being incompatible in the same way. The release notes has all details on this.

PS: Hey, I promised to update the blog on Mondays and Thursdays... but that does not mean I won't publish additional content any other day when the scheduled dates are inappropriate ;-)

Thursday, June 13, 2013

Readability: No abbreviations

When you are writing code, it is very tempting to shorten the names of functions and variables when the shortening seems blatantly obvious. All of us have, at some point, written calc_usage instead of calculate_usage; res instead of result; stmt instead of statement; ctx instead of context; etc. The thought goes "well, these abbreviations are obvious, and I'll be a much faster developer!"

Wrong. You might think you are faster at typing, but you don't write code in one go and never ever get back to it again. As we have already mentioned, code is usually only written once and is later read many times, possibly by people other than the original developer. At the very least, I'd expect you going over your code once before checking it into your repository. Spending the extra minute it takes to write words in full will benefit you and your readers.

Let me go out a bit on a tangent while being no expert in cognition. Our brain is trained to recognize words not only by reading letters left to right; we see the full word at once, and, unless it is a new word that we have never seen before, we recognize what it is without having to go over all its letters. If you truncate words or remove letters in an attempt to make them shorter, reading the word becomes non-obvious and slows you down. Or, rather, it will certainly slow down other readers of the code — particularly if their native language is not the same as yours.

Examples of bad names? They are everywhere: ffs_fragacct, inblk, siz, cgp, cnt, sump, forw... I wish I was making these up, but they all come from a single file I picked at random from the NetBSD tree. Can you tell what any of these names refer to, univocally?

Lastly, keep in mind that when your variable name is not descriptive enough, you will be tempted to add a comment to clarify its purpose. Don't. Fix the variable name to not need a comment in the first place!

So, my suggestion: never use abbreviations. Spell out words in full in all your identifier names, and make the names self-explanatory.

Monday, June 10, 2013

Readability: Blank lines matter

Vertical spacing is important for readability reasons: group together pieces of code that should not be split apart, and otherwise add blank lines among chunks of code that could be easily reordered and/or repurposed. That's a pretty loose suggestion though, so let's look at some specific situations in which you want to consider your vertical spacing practices (both adding and removing).

Give some air to long functions

Functions longer than a handful lines generally deserve some vertical spacing. A first guideline is to separate the return of the result value (along all of its boilerplate) from everything else that the function may be doing.

Cluster together computations or assignments that are related

When populating the contents of a data structure or when defining multiple variables that are supposed to be used together, do not add blank lines among these lines:

settings = FileOpenSettings()
settings.deadline = 30
settings.allow_failure = True

arguments = FileOpenArguments()
arguments.file_name = '/my/document'

result = server.FileOpen(settings, arguments)

The example above represents a fictitious RPC call to a remote server to open a file. The construction of the RPC settings and the arguments belong in different data types, so we construct each object in its own "code paragraph". Note that there are blank lines to separate the construction of each of these but there are no blank lines to separate the various assignments to the same data structure.

Surround conditionals and loops with one blank line, except for their "prologue" or "epilogue"

Every time I write a conditional or a loop, I start by adding one blank line before and after them. Most times, all is good and the code stays this way. However, there are times where some code needs to be put right at the beginning of the conditional or loop (the prologue), or right after (the epilogue). In those cases, these extra pieces of code are not split apart from the conditional or loop they relate to.

A common occurrence of this is when a loop has two exit conditions: one where an error has been detected and one where everything is OK. When the loop is terminated, you need to inspect which of the two conditions happened and act accordingly.

Take a look at this code, which implements a simple function to classify a set of students by the grade they belong to and to raise an error when one or more students are registered to an unknown grade:

def classify_by_grade(students, known_grades):
    students_by_grade = collections.defaultdict(set)

    unclassified_students = set()
    for student in students:
        if student.grade in known_grades:
            students_by_grade[student.grade].add(student)
        else:
            unclassified_students.add(student)
    if len(unclassified_students) == 0:
        raise UnknownGradesError(unclassified_students)

    return students_by_grade

Regarding spacing, we can see that there are three different sections in the code: the definition of the return value, the computation of the value, and the termination of the function. This is usually a good organization.

However, what I want you to pay attention to is the epilogue and prologue of the loop. Because we wanted to report all the students with an unknown grade (instead of just the first one), we had to accumulate these in an auxiliary variable unclassified_students. The existence of this variable, and the later inspection of it, is for the sole purpose of the loop; therefore, such code must be attached to the loop itself without any vertical spacing in between.

If I were to write this same code in a language with strict scoping rules, I'd even put the loop and the definition of unclassified_students in its own block so that the variable was unaccessible after we were done with it.

Group comments with the code they belong to.

Usually, comments are tied to a specific chunk of code. Separate the comment and such code from the rest with blank lines. For example:

test_person = Person(
    # The name of this test person has to be two words long
    # at least.
    name='John Smith',

    # We want this person to be an adult, so ensure the test age
    # is above 21 years old.  Granted, this is only true in some
    # countries but our code assumes 21.
    age=25,
)

Note the blank line right before the comment attached to the assignment to age; it is important. Otherwise, I'd assume that such assignment is somehow related to the assignment to name. Another example on these lines:

class Person(object):

    def __init__(self, name, age):
        # Initialize fields that are copies of the input
        #arguments.
        self.name = name
        self.age = age

        # Now initialize fields that are derived from the input
        # arguments.  Yes, yes, believe me when I say that this
        # is very wrong, but it serves to illustrate the
        # example.
        self.first_name = name.split(' ')[0]
        self.last_name = name.split(' ')[1:]

Thursday, June 06, 2013

Readability: Blocks and variable scoping

In a dynamically-typed language, it is common for the scoping semantics of a variable to be wider than a single code block. For example: in at least Python and the shell, it is the case that a variable defined anywhere within a function —even inside conditionals or loops— is reachable anywhere in the function from there on.

To illustrate what this means, consider this snippet in which we define a function to compute the CPU requirements needed in a database system to support a set of tables:

def cpu_requirements(database):
    disk_bytes, tables_count = calculate_summaries(database)
    if disk_bytes > 0:
        cpu = cpu_favoring_disk_bytes(disk_bytes)
    elif tables_count > 0:
        cpu = cpu_favoring_tables_count(tables_count)
    else:
        cpu = 0

    ... various tens of lines of code ...

    return cpu + SAFETY_MARGIN

The thing I want you to note in this code snippet is that we are defining the cpu variable in all code paths of the conditional and later using the computed value outside the conditional, possibly after tens of lines of code. Practically, there is nothing wrong with this as long as the code works as intended, but I personally find this style to be confusing: the code does not show the intent of the programmer regarding where a variable is going to be used.

As a guideline, define a variable in the outmost block where it is used. Visually, this means defining the variable at the leftmost indentation level in which it is going to be referenced later.

The code above would be rewritten as follows:

def cpu_requirements(tables_info):
    disk_bytes, tables_count = calculate_summaries(database)
    cpu = None
    if disk_bytes > 0:
        cpu = cpu_favoring_disk_bytes(disk_bytes)
    elif tables_count > 0:
        cpu = cpu_favoring_tables_count(tables_count)
    else:
        cpu = 0
    assert cpu is not None, 'cpu not defined in a code path'

    ... various tens of lines of code ...

    return cpu + SAFETY_MARGIN

There are two key ideas behind these tiny adjustments:

  • First, the scope of the cpu variable is explicitly declared to be outside of the conditional. This happens right before entering the alternative code paths, so there is a clear expectation that this variable will be used later on outside of the conditional statement.
  • And second, the expectation that the cpu variable has been assigned a value in all possible code paths is explicitly coded by an assertion. This is important. Consider that the code in each branch of the conditional could be multiple lines long, and the assignment of a value to our variable might be buried among those lines. With the assertion, we want to ensure that whatever happens to the contents of the conditional branches in future revisions does not mean that the initialization of the variable is lost (by mistake).

Everything mentioned here applies to loops and other higher-level constructs as well, particularly try/catch blocks. Consider this code:

def get_current_user_data():
    try:
        user_data = helper_module.get_user_data(os.getuid())
    except helper_module.UserQueryError, e:
        raise BackendError(e)

    try:
        group_data = helper_module.get_group_data(os.getgid())
    except helper_module.GroupQueryError, e:
        raise BackendError(e)

    return user_data, group_data

The code in this example performs two separate queries via a helper module, and these queries raise exceptions defined in the helper module. To be self-contained, our get_current_user_data() rewrites these exceptions as a generic exception defined in the current module. The issue, however, is that the variables defined within the try block are accessed later separately. I would instead do:

def get_current_user_data():
    user_data = None
    try:
        user_data = helper_module.get_user_data(os.getuid())
    except helper_module.UserQueryError, e:
        raise BackendError(e)
    assert user_data is not None

    group_data = None
    try:
        group_data = helper_module.get_group_data(os.getgid())
    except helper_module.GroupQueryError, e:
        raise BackendError(e)
    assert group_data is not None

    return user_data, group_data

Which is very similar to what we did above for conditionals.

To wrap everything up, keep these guidelines in mind:

  • Define variables in the outmost block where they are referenced.
  • If you don't have a good default value to which to initialize the variable to —because, for example, its value is computed in a conditional path— set it to None and later assert that it has been set to something different.

Monday, June 03, 2013

Readability: Series introduction

Dear readers,

This post is the beginning of a new series on the idioms and style that I use to keep code readable and obvious. I often get positive comments when undergoing peer reviews at Google and externally, so I am going to share such personal style in the hope that it might be useful to some of you. Explaining these ideas to coworkers when I review their code has proven to be useful in the long term.

The posts in this series will, in general, cover generic programming issues that are applicable to any language. That said, all of the examples will be in Python and coming up with such standalone code snippets has proven to be hard. While all of the snippets are fictitious and written for the sole purpose of this blog, they are all inspired in code I've had encountered in the past.

There is one little thing that I won't repeat in every article, but that little thing will govern the whole series: readability and simplicity trump performance and cleverness every single time (unless, of course, your code really, truly is performance critical — and most of the code out there just is not). Keep in mind that a piece of code will be written once but it will be read dozens or hundreds of times by yourself or other people. Also, there is the common saying that the reader needs to be a hundred times smarter than the programmer that originally wrote the code in order to understand what the code does... so keep the code dumb or you won't be able to parse what you wrote a few months down the road.

And, lastly, don't worry! Even if this blog has not received any love for the last 7 months, this introductory post will not be left alone without followup articles. I have already written a bunch of the articles in this series to prevent that from happening :-)

Stay tuned.

Monday, October 22, 2012

Setting up my old Mac Mini G4 as a development machine

I've spent quite a few time last week setting up my old Mac Mini G4 — a PPC 1.2GHz with 1GB of RAM running NetBSD/macppc current — as a "workstation" for the development of Kyua and other tools for NetBSD. Yes, this machine is very slow, but that's the whole point of the exercise I'm going to narrate below.

I recently got approval from the NetBSD core team to import Kyua into the NetBSD source tree and replace ATF with it... which is great, but when thinking about it objectively, I am reluctant to unnecessarily "punish" all NetBSD users by increasing the build time of the system significantly. Don't get me wrong: Kyua itself runs efficiently on old machines, but building the code — particularly the hundreds of tests that come with it — takes just too long. This slowdown is not too noticeable on a relatively-modern machine, but it's unacceptable on not-so-old machines.

Of course, I could proceed with the import right now (with the code disabled by default) and make it leaner later, but this would cause quite a first bad impression on our users. So it's better to delay the import a little bit until, at least, I have had a chance to simplify some core parts of the code. Mind you, this simplification work is already in progress and quite advanced: it consists on modularizing (as separate processes) some critical parts of the code and writing these with a much simpler style and in plain C.

But back to the original point of my post.

The first thing to mention about this experience is that with some effort and long waits, I've got a pretty decent setup for development even on this old machine. From time to time, I miss having a real Unix desktop at hand for development (no OS X, you are not one of those). The GUI behaves relatively well with a 1920x1200 display, running Fluxbox, traditional xterms, Mutt for GMail access and a bunch of other applications.

Unfortunately, too many things feel really sluggish. A few specific examples:

  • Firefox 16 is barely usable. I'm not sure there are many alternatives to decent web browsing for such an old non-Intel platform. Dillo is blazing fast and allows me to access online documentation and mailing list archives (just enough for development), but it's pretty much useless for any other purpose given the "Web 2.0".
  • Any operation that involves pkgsrc takes ages. Even when building the simplest packages, one can notice the system crawl through all of the pkgsrc infrastructure. Sometimes this is the fault of a bad algorithm; other times it's just sh(1) being the wrong tool for something as complex as pkgsrc internals.
  • Things like basic code editing in Emacs 24 are slow at responding to typing. Disabling font lock mode makes it feel fast again, but it's just surprising to see that even color coding is slow.
I still remember my old and trusty machine from 10 years ago (a Pentium II 233 MHz): with a similar setup, it was significantly snappier. Yes, software has evolved and these packages now have many more features... but really, does editing a text file have to be sluggish?

Leaving aside sluggishness, there is also the problem of instability. NetBSD/macppc is a tier 2 port, and things don't work as well as one would like. I personally would enjoy bringing this port to tier 1... but I currently lack the time (and basic knowledge of the architecture) to do so :-/

Anyway, the result of this exercise: the new code I'm writing to modularize Kyua builds damn fast in this machine, and keeping it this way is the whole point of having such an old machine as a build environment. So I'll try to keep using it to develop these new components of Kyua.