Thursday, August 24, 2006

A split function in Haskell

Splitting a string into parts based on a token delimiter is a very common operation in some problem domains. Languages such as Perl or Java provide a split function in their standard library to execute this algorithm, yet I'm often surprised to see how many languages do not have one. As far as I can tell neither C++ nor Haskell have it so I have coded such a function in the past multiple times in both languages. (This is not exactly true: Haskell has the words function which splits a string by whitespace characters. Nevertheless I didn't know this when I wrote my custom implementation.)

When I implemented a custom split function in Haskell I was really amazed to see how easy and clean the resulting code was. I'm sure there is some better and even cleaner way to write it because I'm still a Haskell newbie! Here is it:
split :: String -> Char -> [String]
split [] delim = [""]
split (c:cs) delim
| c == delim = "" : rest
| otherwise = (c : head rest) : tail rest
where
rest = split cs delim
The above code starts by declaring the function's type; this is optional because Haskell's type system is able to automatically deduce it. It then uses pattern matching to specify the algorithm's base and recursive cases. At last, the recursive case is defined by parts, just as you do in mathematics. Oh, and why recursivity? Because iteration does not exist in functional programming in the well-known sense of imperative languages. Also note the lack of variables (except for the input ones) and that everything is an evaluable expression.

Let's now compare the above code with two implementations in C++. A first approach to the problem following common imperative programming thinking results in an iterative algorithm:
std::deque< std::string >
split_iterative(const std::string& str, char delim)
{
std::deque< std::string > parts;

std::string word;
for (std::string::const_iterator iter = str.begin();
iter != str.end(); iter++) {
if (*iter == delim) {
parts.push_back(word);
word.clear();
} else
word += *iter;
}
parts.push_back(word);

return parts;
}
This is certainly uglier and much more difficult to prove right; iteration is a complex concept in that sense. In this code we have variables that act as acumulators, temporary objects, commands, etc. Be glad that I used C++ and not C to take advantage of STL containers.

OK, to be fair the code should be implemented in a recursive way to be really comparable to the Haskell sample function. Let's attempt it:
std::deque< std::string >
split_recursive(const std::string& str, char delim)
{
std::deque< std::string > parts;

if (!str.empty()) {
std::string str2 = str;
parts = split_recursive(str2.erase(0, 1), delim);

if (str[0] == delim)
parts.push_front("");
else
parts[0] = str[0] + parts[0];
} else
parts.push_front("");

return parts;
}
This split_recursive function follows the same algorithm as the split written in Haskell. I find that it is still harder to read and more delicate (I had some segmentation fault s until I got it right).

Of course Haskell is not appropriate for everything (which is true for every language out there). I have yet to write a big and useful program in Haskell to really see its power and to be able to relly compare it to other languages. All I can do at the moment is to compare trivial stuff as the above.