Chapter 8. Case study: regular expressions and file name matching

Table of Contents

File name matching
Regular expressions in Haskell
The many types of result
Mixing and matching string types
Other things you should know
Translating a glob pattern into a regular expression
Exercises
An important aside: writing lazy functions
Making use of our pattern matcher
Exercises
Handling errors through API design
Exercises
Putting our code to work
Exercises
Wrapping it up

Working with text strings is a fundamental tool in the programmer's toolbox. We've already been introduced to bytestrings and ropes as efficient ways to work with large text strings. Throughout this book, we'll be returning to string manipulation over and again, to show off the breadth of different techniques and libraries we can use to tackle a variety of problems that happen to involve working with strings.

In this chapter, we'll develop a library that can match file names against “glob-style” file name patterns. These patterns are a common feature of command shells. To begin, we'll introduce the pattern language that we'll be working with. Our library will translate these patterns into regular expressions, so we'll need to understand how to use regular expressions in Haskell. Next, we'll talk about how to use Haskell's standard directory listing functions with this library. Along the way, we'll talk a little about writing portable code, good API design, and constructing code from small building blocks.

File name matching

Many “system-oriented” programming languages provide library routines that let us match a file name against a pattern, or that will give a list of files that match the pattern. (In other languages, this function is often named fnmatch.) Although Haskell's standard library generally has good “system programming” facilities, it doesn't provide these kinds of pattern matching functions. We'll take this as an opportunity to develop our own.

The kinds of patterns we'll be dealing with are commonly referred to as “glob patterns” (the term we'll use), “wildcard patterns”, or “shell-style patterns”. These patterns have just a few simple rules. You probably already know them, but we'll quickly recap here.

  • Matching a string against a pattern starts at the beginning of the string, and finishes at the end.

  • Most literal characters match themselves. For example, the text foo in a pattern will match foo, and only foo, in an input string.

  • The * (asterisk) character means “match anything”; it will match any text, including the empty string.

  • The ? (question mark) character means “match any single character”.

  • A [ (open square bracket) character begins a character class, which is ended by a ]. Its meaning is “match any character in this class”. A character class can be negated by following the opening [ with a !, so that it means “match any character not in this class”.

    As a shorthand, a character followed by a - (dash), followed by another character, denotes a range: “match any character within this set”.

    Character classes have an added subtlety; they can't be empty. The first character after the opening [ or [! is part of the class, so we can write a class containing the ] character as []aeiou].

While Haskell doesn't provide a way to match glob patterns among its standard libraries, it has excellent regular expression matching facilities. Glob patterns are nothing more than cut-down regular expressions with slightly different syntax. It's easy to convert glob patterns into regular expressions, so that's what we'll do. But in order to do so, we must first understand how to use regular expressions in Haskell.

Regular expressions in Haskell

In this section, I'll be assuming that you are already familiar with regular expressions (which I'll abbreviate as regexps from here on) by way of some other language, such as Python, Perl, or Java. Rather than introduce them as something new, I'll be focusing on what's different about regexp handling in Haskell, compared to other languages. Haskell's regular expression matching libraries are a lot more expressive than those of other languages, so there's plenty to talk about.

To begin our exploration of the regexp libraries, the only module we'll need to work with is Text.Regex.Posix. As usual, the most convenient way to explore this module is by interacting with it via ghci.

ghci> :module +Text.Regex.Posix
ghci> :module +Text.Regex.Posix

The only function that we're likely to need for normal use is the regexp matching function, an infix operator named =~ (borrowed from Perl. The first hurdle to overcome is that Haskell's regexp libraries make heavy use of polymorphism. As a result, the type signature of the =~ operator is a bit difficult to understand. Let's drop into ghci and see what it looks like.

ghci> :type (=~)
(=~) :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 target,
 Text.Regex.Base.RegexLike.RegexMaker Regex
				      CompOption
				      ExecOption
				      source) =>
source1 -> source -> target
ghci> :type (=~)
(=~) :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 target,
 Text.Regex.Base.RegexLike.RegexMaker Regex
				      CompOption
				      ExecOption
				      source) =>
source1 -> source -> target

Ouch! Rather than pick apart what this convoluted definition means, let's take a shortcut. The =~ operator uses typeclasses for both of its arguments, and also for its return type. The first argument (on the left of the =~) is the text to match; the second (on the right) is the regular expression to match against. We can pass either a String or a Data.ByteString as either argument.

The many types of result

The =~ operator is polymorphic in its return type, so the Haskell compiler needs some way to know what type of result we would like. In real code, it may be able to infer the right type, due to the way we subsequently use the result. But such cues are often lacking when we're exploring with ghci. If we omit a specific type for the result, we'll get an intimidating-looking error message. Here's what ghci tells us if we try to do a regexp match but omit the return type.

ghci> "my left foot" =~ "foo"

<interactive>:1:0:
    No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
							    [Char]
							    target)
      arising from use of `=~' at <interactive>:1:0-22
    Possible fix:
      add an instance declaration for
      (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target)
    In the expression: "my left foot" =~ "foo"
    In the definition of `it': it = "my left foot" =~ "foo"
ghci> "my left foot" =~ "foo"

<interactive>:1:0:
    No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
							    [Char]
							    target)
      arising from use of `=~' at <interactive>:1:0-22
    Possible fix:
      add an instance declaration for
      (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target)
    In the expression: "my left foot" =~ "foo"
    In the definition of `it': it = "my left foot" =~ "foo"

That's quite an impenetrable error message, but what it means is that ghci can't infer what type of result to give back to us. (This is the same kind of error that a compiler would print if it couldn't infer the correct type in a real program.) Here's the key to dealing with this error. When ghci can't infer the target type, we tell it what we'd like the type to be. If we want an expression of type Bool, we'll get a pass/fail result.

ghci> "my left foot" =~ "foo" :: Bool
Loading package regex-base-0.71 ... linking ... done.
Loading package regex-posix-0.71 ... linking ... done.
True
ghci> "your right hand" =~ "bar" :: Bool
False
ghci> "your right hand" =~ "(hand|foot)" :: Bool
True
ghci> "my left foot" =~ "foo" :: Bool
Loading package regex-base-0.71 ... linking ... done.
Loading package regex-posix-0.71 ... linking ... done.
True
ghci> "your right hand" =~ "bar" :: Bool
False
ghci> "your right hand" =~ "(hand|foot)" :: Bool
True

In the bowels of the base regexp library, there's a typeclass named RegexContext that describes how a target type should behave; the base library defines many instances of this typeclass for us. The Bool type is an instance of this typeclass, so we get back a usable result. Another such instance is Int, which gives us a count of the number of times the regexp matches.

ghci> "a star called henry" =~ "planet" :: Int
0
ghci> "honorificabilitudinitatibus" =~ "[aeiou]" :: Int
13
ghci> "a star called henry" =~ "planet" :: Int
0
ghci> "honorificabilitudinitatibus" =~ "[aeiou]" :: Int
13

If we ask for a String result, we'll get the first substring that matches, or an empty string if nothing matches.

ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String
"ii"
ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: String
""
ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String
"ii"
ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: String
""

Another valid type of result is [String], which returns a list of all matching strings.

ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String]
["ii","uu"]
ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: [String]
[]
ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String]
["ii","uu"]
ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: [String]
[]
[Note]Note

If you want a result that's a plain String, beware. Since (=~) returns an empty string to signify “no match”, this poses an obvious difficulty if the empty string could also be a valid match for the regexp. If such a case arises, you should use a different return type instead.

FIXME: Find some place to mention the monadic =~~ operator.

That's about it for “simple” result types, but we're not by any means finished. Before we continue, let's use a single pattern for our remaining examples. We can define this pattern as a variable in ghci, to save a little typing.

ghci> let pat = "(foo[a-z]*bar|quux)"
ghci> let pat = "(foo[a-z]*bar|quux)"

We can obtain quite a lot of information about the context in which a match occurs. If we ask for a three-element tuple, we'll get back the text before the first match, the text of that match, and the text that follows it.

ghci> "before foodiebar after" =~ pat :: (String,String,String)
("before ","foodiebar"," after")
ghci> "before foodiebar after" =~ pat :: (String,String,String)
("before ","foodiebar"," after")

If the match fails, the entire text is returned as the “before” element of the tuple, with the other two elements left empty.

ghci> "no match here" =~ pat :: (String,String,String)
("no match here","","")
ghci> "no match here" =~ pat :: (String,String,String)
("no match here","","")

Asking for a four-element tuple gives us a fourth element that's a list of all groups in the pattern that matched.

ghci> "before foodiebar after" =~ pat :: (String,String,String,[String])
("before ","foodiebar"," after",["foodiebar"])
ghci> "before foodiebar after" =~ pat :: (String,String,String,[String])
("before ","foodiebar"," after",["foodiebar"])

We can get numeric information about matches, too. A pair of Ints gives us the starting offset of the first match, and its length. If we ask for a list of these pairs, we'll get this information for all matches.

ghci> "before foodiebar after" =~ pat :: (Int,Int)
(7,9)
ghci> "i foobarbar a quux" =~ pat :: [(Int,Int)]
[(2,9),(14,4)]
ghci> "before foodiebar after" =~ pat :: (Int,Int)
(7,9)
ghci> "i foobarbar a quux" =~ pat :: [(Int,Int)]
[(2,9),(14,4)]

A failed match is represented by the value -1 as the first element of the tuple (the match offset) if we've asked for a single tuple, or an empty list if we've asked for a list of tuples.

ghci> "eleemosynary" =~ pat :: (Int,Int)
(-1,0)
ghci> "mondegreen" =~ pat :: [(Int,Int)]
[]
ghci> "eleemosynary" =~ pat :: (Int,Int)
(-1,0)
ghci> "mondegreen" =~ pat :: [(Int,Int)]
[]

Believe it or not, this is not a comprehensive list of built-in instances of the RegexContext typeclass. For a complete list, see the documentation for the Text.Regex.Base.Context module.

Mixing and matching string types

As we noted earlier, the =~ operator uses typeclasses for its argument types and its return type. Any combination of String and ByteString will work for the types of both the regular expression and the text to match against. Recall, from our coverage of bytestrings in chapter FIXME, that the pack function takes a String and returns its corresponding ByteString.

ghci> :module +Data.ByteString.Char8
ghci> :type pack "foo"
pack "foo" :: ByteString
ghci> :module +Data.ByteString.Char8
ghci> :type pack "foo"
pack "foo" :: ByteString

We can then try using different combinations of String and ByteString.

ghci> pack "foo" =~ "bar" :: Bool
False
ghci> "foo" =~ pack "bar" :: Int
0
ghci> pack "foo" =~ pack "o" :: [(Int, Int)]
[(1,1),(2,1)]
ghci> pack "foo" =~ "bar" :: Bool
False
ghci> "foo" =~ pack "bar" :: Int
0
ghci> pack "foo" =~ pack "o" :: [(Int, Int)]
[(1,1),(2,1)]

However, we need to be aware that if we want a string value in the result of a match, the text we're matching against must be the same type of string. Let's see what this means in practice.

ghci> pack "good food" =~ ".ood" :: [ByteString]
["good","food"]
ghci> pack "good food" =~ ".ood" :: [ByteString]
["good","food"]

In the above example, we've used the pack to turn a String into a ByteString. The typechecker accepts this because ByteString appears in the result type. But if we try getting a String out, that won't work.

ghci> "good food" =~ ".ood" :: [ByteString]

<interactive>:1:0:
    No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
							    [Char]
							    [ByteString])
      arising from use of `=~' at <interactive>:1:0-20
    Possible fix:
      add an instance declaration for
      (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString])
    In the expression: "good food" =~ ".ood"
    In the expression: "good food" =~ ".ood" :: [ByteString]
    In the definition of `it':
	it = "good food" =~ ".ood" :: [ByteString]
ghci> "good food" =~ ".ood" :: [ByteString]

<interactive>:1:0:
    No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
							    [Char]
							    [ByteString])
      arising from use of `=~' at <interactive>:1:0-20
    Possible fix:
      add an instance declaration for
      (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString])
    In the expression: "good food" =~ ".ood"
    In the expression: "good food" =~ ".ood" :: [ByteString]
    In the definition of `it':
	it = "good food" =~ ".ood" :: [ByteString]

And ouch! That's quite an error message. Don't let its verbosity confound you; we can easily fix this problem by making the string types of the left hand side and the result match once again.

ghci> "good food" =~ ".ood" :: [String]
["good","food"]
ghci> "good food" =~ ".ood" :: [String]
["good","food"]

This restriction does not apply to the type of the regexp we're matching against. It can be either a String or ByteString, unconstrained by the other types in use.

Other things you should know

When you look through Haskell library documentation, you'll see several regexp-related modules. The modules under Text.Regex.Base define the common API adhered to by all of the other regexp modules. It's possible to have multiple implementations of the regexp API installed at one time. At the time of writing, GHC is bundled with one implementation, Text.Regex.Posix. As its name suggests, this package provides POSIX regexp semantics.

[Note]Note

If you're coming to Haskell from a language like Perl, Python, or Java, and you've used regular expressions in one of those languages, you should be aware that the POSIX regexps handled by the Text.Regex.Posix module are different in some significant ways from Perl-style regexps. Here are a few of the more notable differences.

Perl regexp engines do left-biased matching when matching alternatives, whereas POSIX engines choose the greediest match. What this means is that given a regexp of (foo|fo*) and a text string of foooooo, a Perl-style engine will give a match of foo (the leftmost match), while a POSIX engine will match the entire string (the greediest match).

POSIX regexps have less uniform syntax than Perl-style regexps. They also lack a number of capabilities provided by Perl-style regexps, such as zero-width assertions and control over greedy matching.

Other Haskell regexp packages are available for download on the Internet. Some provide faster engines than the current POSIX engine; others provide Perl-style matching capabilities. All follow the standard API that we have covered in this section.

Translating a glob pattern into a regular expression

Now that we've seen the myriad of ways to match text against regular expressions, let's turn our attention back to glob patterns. We want to write a function that will take a glob pattern and return its representation as a regular expression. Both glob patterns and regexps are text strings, so the type that our function ought to have seems clear.

globToRegex :: String -> StringglobToRegex :: String -> String

We start our definition of the globToRegex function by recalling that a text string must match a glob pattern. Before we attempt to convert any part of the glob pattern, we need to have a “rooted” regular expression.

globToRegex cs = '^' : globToRegex' csglobToRegex cs = '^' : globToRegex' cs

Recall that the String is just a synonym for [Char], a list of Chars. The : operator puts a value (the ^ character in this case) onto the front of a list, where the list is the value returned by the yet-to-be-seen globToRegex' function.

That single-quote character in the name globToRegex'is not a typo, by the way; Haskell is unusual in allowing single quotes as parts of identifiers. A single quote is most likely to appear at the end of a name, where it's often pronounced “prime”. (We could, if we wanted to, name a function head'n'tail'n'such, but that's pretty unusual.)

[Note]Note

Haskell does not require that a value or function be declared or defined in a source file before it's used. It's perfectly normal for a definition to come after the first place it's used. The Haskell compiler doesn't care about ordering at this level. This grants us the flexibility to structure your code in the manner that makes most logical sense to us, as opposed to a way that makes the compiler writer's life easiest.

Haskell module writers often use this flexibility to put “more important” code earlier in a source file, relegating “plumbing” to later. In fact, that's exactly how we're presenting the globToRegex function and its helpers here.

With the regular expression rooted, the globToRegex' function will do the bulk of the translation work. We'll use the convenience of Haskell's pattern matching to enumerate each of the cases we'll need to cover.

globToRegex' :: String -> String
globToRegex' "" = "$"
globToRegex' ('*':cs) = ".*" ++ globToRegex' csglobToRegex' :: String -> String
globToRegex' "" = "$"
globToRegex' ('*':cs) = ".*" ++ globToRegex' cs

We now have a very minimal glob translator. Our first clause stipulates that if we hit the end of our glob pattern (by which time we'll be looking at the empty string), we return $, the regular expression symbol for “match end-of-line”. The second gets us to substitute the string .* every time we see a * in our input list. And the third passes every other character through, unaltered.

We can immediately save our new source file (let's call it GlobRegexTiny.hs and start playing with it in ghci. This is a great way to do exploratory programming: write a little code, load it into the interpreter, and see what happens.

ghci> :load glob-to-regexp/GlobRegexTiny.hs
[1 of 1] Compiling Main             ( glob-to-regexp/GlobRegexTiny.hs, interpreted )
Ok, modules loaded: Main.
ghci> globToRegex "*"
"^.*$"
ghci> globToRegex ""
"^$"
ghci> globToRegex "**"
"^.*.*$"
ghci> :load glob-to-regexp/GlobRegexTiny.hs
[1 of 1] Compiling Main             ( glob-to-regexp/GlobRegexTiny.hs, interpreted )
Ok, modules loaded: Main.
ghci> globToRegex "*"
"^.*$"
ghci> globToRegex ""
"^$"
ghci> globToRegex "**"
"^.*.*$"

These few lines of interaction might look trivial, but we have received immediate feedback on two fronts: our code passes the daunting scrutiny of Haskell's typechecker, and it produces sensible-looking results. Monkeying around with code in the interpreter early and often is consistently worthwhile.

The remaining cases for a complete definition of globToRegex' are easy to enumerate. First, we get ? out of the way.

globToRegex' ('?':cs) = '.' : globToRegex' csglobToRegex' ('?':cs) = '.' : globToRegex' cs

Should we revisit the different types of the (++) and (:) operators here? Newcomers often get confused over which to use.

More interesting is how we handle character classes.

globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs
globToRegex' ('[':c:cs) = '[' : c : charClass cs
globToRegex' ('[':_) = error "unterminated character class"globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs
globToRegex' ('[':c:cs) = '[' : c : charClass cs
globToRegex' ('[':_) = error "unterminated character class"

We take advantage of two behaviours of Haskell's pattern matcher here. The first is that it will match patterns in the order in which we declare them, so we place the most specific patterns first, and the least specific last. Secondly, there's no limit to the “depth” of a pattern: we can “peek” forwards into a list as far as we need to. (This isn't limited to simple lists, either. We can use the same capability to look inside nested structures.)

Here, the first pattern matches on three consecutive items of its input to ensure that a “negative” character class cannot be empty. The second pattern matches on only two items, so ensure that a “positive” character class cannot be empty. The final clause can only match if it's given a syntactically invalid character class.

Finally, we may need to escape some characters before we return them.

globToRegex' (c:cs) = escape c ++ globToRegex' csglobToRegex' (c:cs) = escape c ++ globToRegex' cs

The escape function ensures that the regexp engine will not interpret certain characters as pieces of regular expression syntax.

escape :: Char -> String
escape c
    | c `elem` regexChars = '\\' : [c]
    | otherwise = [c]
    where regexChars = "\\+()^$.{}]|"escape :: Char -> String
escape c
    | c `elem` regexChars = '\\' : [c]
    | otherwise = [c]
    where regexChars = "\\+()^$.{}]|"

The charClass helper function does nothing more than ensure that a character class is correctly terminated. It passes its input through unmolested until it hits a ], when it hands control back to globToRegex'.

charClass :: String -> String
charClass (']':cs) = ']' : globToRegex' cs
charClass (c:cs) = c: charClass cs
charClass [] = error "unterminated character class"charClass :: String -> String
charClass (']':cs) = ']' : globToRegex' cs
charClass (c:cs) = c: charClass cs
charClass [] = error "unterminated character class"

Now that we've finished defining globToRegex and its helpers, let's load it into ghci and try it out.

ghci> :load glob-to-regexp/GlobRegex.hs
[1 of 1] Compiling GlobRegex        ( glob-to-regexp/GlobRegex.hs, interpreted )
Ok, modules loaded: GlobRegex.
ghci> :module +Text.Regex.Posix
ghci> globToRegex "f??.c"
Loading package regex-base-0.71 ... linking ... done.
Loading package regex-posix-0.71 ... linking ... done.
"^f..\\.c$"
ghci> :load glob-to-regexp/GlobRegex.hs
[1 of 1] Compiling GlobRegex        ( glob-to-regexp/GlobRegex.hs, interpreted )
Ok, modules loaded: GlobRegex.
ghci> :module +Text.Regex.Posix
ghci> globToRegex "f??.c"
Loading package regex-base-0.71 ... linking ... done.
Loading package regex-posix-0.71 ... linking ... done.
"^f..\\.c$"

Sure enough, that looks like a reasonable regexp. Can we use it to match against a string?

ghci> "foo.c" =~ globToRegex "f??.c" :: Bool
True
ghci> "test.c" =~ globToRegex "t[ea]s*" :: Bool
True
ghci> "taste.txt" =~ globToRegex "t[ea]s*" :: Bool
True
ghci> "foo.c" =~ globToRegex "f??.c" :: Bool
True
ghci> "test.c" =~ globToRegex "t[ea]s*" :: Bool
True
ghci> "taste.txt" =~ globToRegex "t[ea]s*" :: Bool
True

It works! Now let's play around a little with ghci. We can create a temporary definition for fnmatch and try it out. (This is another example of how ghci is great for exploratory programming. We're not limited to defining temporary variables; we can also introduce temporary functions when we need to.)

ghci> let fnmatch pat name = name =~ globToRegex pat :: Bool
ghci> :type fnmatch
fnmatch :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 Bool) =>
String -> source1 -> Bool
ghci> fnmatch "myname" "d*"
False
ghci> let fnmatch pat name = name =~ globToRegex pat :: Bool
ghci> :type fnmatch
fnmatch :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 Bool) =>
String -> source1 -> Bool
ghci> fnmatch "myname" "d*"
False

The name “fnmatch” doesn't really have the “Haskell nature”, though. The typical Haskell style is for functions to have descriptive, “camel cased” names. Camel casing name smashes words together, capitalising each word in the resulting name; the name comes from the “humps” introduced by the capital letters. In our library, we'll give this function the name matchesGlob.

matchesGlob :: FilePath -> String -> Bool
name `matchesGlob` pat = name =~ globToRegex patmatchesGlob :: FilePath -> String -> Bool
name `matchesGlob` pat = name =~ globToRegex pat
[Note]Note

The choice of naming conventions and coding styles is a perennial source of heated discussion, perhaps because the stakes are so low that everyone feels entitled to an opinion. When we introduce style-related notions like camel cased identifiers, we're showing you what Haskell convention looks like. We're not making a statement about the relative merits of different people's preferences. All we do claim is that it's better to fit in with an existing “house style” than to forge off in a novel stylistic direction.

Exercises

1.

Use ghci to explore what happens if you pass a malformed pattern, such as [, to globToRegex. Write a small function that calls globToRegex, and pass it a malformed pattern. What happens?

2.

While filesystems on Unix are usually sensitive to case (e.g. “G” vs. “g”) in file names, Windows filesystems are not. Add a parameter to the globToRegex and matchesGlob functions that allows control over case sensitive matching.

An important aside: writing lazy functions

In an imperative language, the globToRegex' function is one that we'd usually express as a loop. For example, Python's standard fnmatch module includes a function named translate that does exactly the same job as our globToRegex function. It's written as a loop.

If you've been exposed to functional programming through a language such as Scheme or ML, you've probably had drilled into your head the notion that “the way to emulate a loop is via tail recursion”. A function usually needs a little local scratch storage when it executes. A tail recursive function must either return a plain value, or make a recursive call as the last thing it does (these kinds of call are called “tail calls”). Since a function making a tail call can by definition never use any of its local scratch storage again, a language implementation can reuse that space when it makes the tail call. This means that a series of tail calls can execute in constant space, just as we take for granted with a loop.

Looking at the globToRegex' function, we can see that it is not tail recursive. To see why, let's examine its final clause again (several of its other clauses are structured similarly).

globToRegex' (c:cs) = escape c ++ globToRegex' csglobToRegex' (c:cs) = escape c ++ globToRegex' cs

It calls itself recursively, but this isn't the last thing the function does. The result of the recursive call is a parameter to the (++) function.

So what's going on here? Why are we not writing a tail recursive definition for this function? The answer lies with Haskell's non-strict evaluation strategy. Before we start talking about that, let's quickly talk about why, in a traditional language, we'd be trying to avoid this kind of recursive definition.

Returning to the clause above, globToRegex' computes a little bit of its result, then calls itself recursively, then returns the complete result. In a traditional language, each recursive call is going to require the allocation of temporary scratch memory until it returns.

Image: stack allocation for a recursive call to globToRegex'.

Compound these scratch allocations over a large number of recursive calls, and the amount of space we use while processing grows linearly with the size of the list we must process.

Image: stack allocation for a pile of recursive calls.

For a problem like glob-to-regexp conversion, where we'll always be dealing with very small amounts of data, this overhead is insignificant. But if we had a hundred million elements to process, and used plain recursion rather than tail recursion or a loop to process them, we'd have a severe problem.

Haskell neatly defangs this problem by deferring the evaluation of an expression until its result is needed. To understand how this deferred evaluation works, let's walk through what happens when the final clause of the globToRegex' is called. (As you read, bear in mind that this is a simplified description.)

In order to determine that this clause must be called, the pattern matcher must inspect a little bit of the list it was passed. The pattern requires it to do nothing more than determine whether it has been given an empty list (in which case the clause will not be evaluated), or a non-empty list (the case we're interested in). Once the pattern matcher has established that the list isn't empty, it goes no deeper. It doesn't look inside c, the head of the list. It doesn't follow cs, the tail.

We have now decided to evaluate the right hand side of the clause. The first expression we must evaluate is the call to the list append function, (++). (Because we don't need the results of escape c or globToRegex' cs yet, we suspend the evaluation of those functions, to evaluate when we'll need them.) So let's remind ourselves what (++) looks like:

(++)         :: [a] -> [a] -> [a]
[]     ++ ys = ys
(x:xs) ++ ys = x : (xs ++ ys)(++)         :: [a] -> [a] -> [a]
[]     ++ ys = ys
(x:xs) ++ ys = x : (xs ++ ys)

As the argument to (++) is not an empty list, the second clause matches, so we must evaluate x. This requires that we evaluate escape c, which we had earlier deferred. Let's look again at the definition of escape.

escape :: Char -> String
escape c
    | c `elem` regexChars = '\\' : [c]
    | otherwise = [c]
    where regexChars = "\\+()^$.{}]|"escape :: Char -> String
escape c
    | c `elem` regexChars = '\\' : [c]
    | otherwise = [c]
    where regexChars = "\\+()^$.{}]|"

This will return a list whose first element is either the \ character or the character it was passed. Once we've constructed the first element of this list, we'll defer constructing the rest of it until our caller needs it. We suspend the evaluation of the rest of escape so that we can pass this datum back to our caller. Haskell implementations do this by saving away enough information to resume the evaluation later; this saved package of information is called a “thunk”.

Not only is the evaluation of escape suspended; so too are the evaluation of its callers, (++) and globToRegex', and so on up the call chain until some expression needs to inspect the value.

What would happen if the caller of globToRegex' was for some reason the function null? null only cares whether its argument is an empty list: once it finds this out, it's never going to look at the remainder of the result of globToRegex'. None of the work we deferred in order to get a partial result back to null will ever actually occur.

Remarkably, given this kind of evaluation strategy, our recursive definition of globToRegex' will execute in constant space (more or less). To see why, look at the definition of (++); although it's simpler, it has a similar recursive structure.

If we put on our implementor's hats and think about how we might suspend the evaluation of (++) in a thunk, all we need to store are the current heads of the left and right lists; the result of the next evaluation step depends on nothing more. On each resumption of the thunk, we could simply peel off the head of the left list and update the thunk with the new head, repeating until we reach its end. Therefore, we can execute in constant space.

Making use of our pattern matcher

It's all very well to have a function that can match glob patterns, but we'd like to be able to put this to practical use. On Unix-like systems, the glob function returns the names of all files and directories that match a given glob pattern. Let's build a similar function in Haskell. Following the Haskell norm of descriptive naming, we'll call our function namesMatching.

module Glob (namesMatching) wheremodule Glob (namesMatching) where

This function will obviously have to manipulate filesystem paths a lot, splicing and joining them as it goes. We'll need to use a few previously unfamiliar modules along the way.

The System.Directory module provides standard functions for working with directories and their contents.

import System.Directory (doesDirectoryExist, doesFileExist,
                         getCurrentDirectory, getDirectoryContents)import System.Directory (doesDirectoryExist, doesFileExist,
                         getCurrentDirectory, getDirectoryContents)

The System.FilePath module abstracts the details of an operating system's path name conventions. Using this together with the System.Directory module, we can write a portable namesMatching function that will run on both Unix-like and Windows systems.

import System.FilePath (dropTrailingPathSeparator, splitFileName, (</>))import System.FilePath (dropTrailingPathSeparator, splitFileName, (</>))

Finally, we'll be emulating a “for” loop; getting our first taste of exception handline in Haskell; and of course using the matchesGlob function we just wrote.

import Control.Exception (handle)
import Control.Monad (forM)
import GlobRegex (matchesGlob)import Control.Exception (handle)
import Control.Monad (forM)
import GlobRegex (matchesGlob)

Since directories and files live in the “real world” of activities that have effects, our globbing function will have to have IO in its result type.

namesMatching :: String -> IO [FilePath]namesMatching :: String -> IO [FilePath]

If the string we're passed doesn't contain any pattern characters, all we need to do is check that the given name exists in the filesystem. (Notice that we use Haskell's function guard syntax here to write a nice tidy definition. An “if” would do, but isn't as aestheticall pleasing.)

isPattern :: String -> Bool
isPattern = any (`elem` "[*?")

namesMatching pat
  | not (isPattern pat) = do
    exists <- doesNameExist pat
    return (if exists then [pat] else [])isPattern :: String -> Bool
isPattern = any (`elem` "[*?")

namesMatching pat
  | not (isPattern pat) = do
    exists <- doesNameExist pat
    return (if exists then [pat] else [])

(You might have noticed that the function doesNameExist that we use here isn't in the list of functions we imported from System.Directory. We'll define it shortly.)

What if the string is a glob pattern?

  | otherwise = do  | otherwise = do

We use splitFileName to split the string into a pair of “everything but the final name” and “the final name”. If the first element is empty, we're looking for a pattern in the current directory.

    case splitFileName pat of
      ("", baseName) -> do
          curDir <- getCurrentDirectory
          listMatches curDir baseName    case splitFileName pat of
      ("", baseName) -> do
          curDir <- getCurrentDirectory
          listMatches curDir baseName

Otherwise, we must check the directory name and see if it contains patterns. If it does not, we create a singleton list of theb directory name. If it contains a pattern, we list all of the matching directories.

      (dirName, baseName) -> do
          dirs <- if isPattern dirName
                  then namesMatching (dropTrailingPathSeparator dirName)
                  else return [dirName]      (dirName, baseName) -> do
          dirs <- if isPattern dirName
                  then namesMatching (dropTrailingPathSeparator dirName)
                  else return [dirName]

Point out the use of return above. This will be one of the first times the reader will have encountered it. Even if we've already discussed this topic earlier, we should repeat ourselves.

[Note]Note

Using the System.FilePath module can be a litle tricky. Above is a case in point; let's use ghci to illustrate the problem. The splitFileName function leaves a trailing slash on the end of the directory name that it returns.

ghci> :module +System.FilePath
ghci> splitFileName "foo/bar"
Loading package FilePath-0.11 ... linking ... done.
("foo/","bar")
ghci> :module +System.FilePath
ghci> splitFileName "foo/bar"
Loading package FilePath-0.11 ... linking ... done.
("foo/","bar")

If we didn't remember (or know enough) to remove that slash, we'd recurse endlessly in that call to namesMatching above, because of the following behaviour of splitFileName.

ghci> splitFileName "foo/"
("foo/","")
ghci> splitFileName "foo/"
("foo/","")

You can guess why this note got to be written!

Finally, we collect all matches in every directory, giving us a list of lists, and concatenate them into a single list of names.

          let listDir = if isPattern baseName
                        then listMatches
                        else listPlain
          pathNames <- forM dirs $ \dir -> do
                           baseNames <- listDir dir baseName
                           return (map (dir </>) baseNames)
          return (concat pathNames)          let listDir = if isPattern baseName
                        then listMatches
                        else listPlain
          pathNames <- forM dirs $ \dir -> do
                           baseNames <- listDir dir baseName
                           return (map (dir </>) baseNames)
          return (concat pathNames)

The unfamiliar forM function above acts a little like a “for” loop: it maps its second argument (an action) over its first (a list), and returns the resulting list.

We have a few loose ends to clean up. The first is the definition of the doesNameExist function, used above. The System.Directory doesn't let us check to see if a name exists in the filesystem. It forces us to think about whether we want to check for the existence of a file or a directory, even if we don't care what kind of thing the name is. This ungainly API is brought upon us by Windows. We react by rolling the two checks into a single function, so that we can ignore this portability quirk. In the name of performance, we make the check for a file first, since files are far more common than directories.

doesNameExist :: FilePath -> IO Bool

doesNameExist name = do
    fileExists <- doesFileExist name
    if fileExists
      then return True
      else doesDirectoryExist namedoesNameExist :: FilePath -> IO Bool

doesNameExist name = do
    fileExists <- doesFileExist name
    if fileExists
      then return True
      else doesDirectoryExist name

We have two other functions to define, each of which returns a list of names in a directory. The listMatches function returns a list of all files matching the given glob pattern in a directory, while the listPlain function returns either an empty or singleton list, depending on whether the single name it's passed exists.

listMatches :: FilePath -> String -> IO [String]
listMatches dirName pat = do
    dirName' <- if null dirName
                then getCurrentDirectory
                else return dirName
    handle (const (return [])) $ do
        names <- getDirectoryContents dirName'
        let names' = if isHidden pat
                     then filter isHidden names
                     else filter (not . isHidden) names
        return (filter (`matchesGlob` pat) names')

listPlain :: FilePath -> String -> IO [String]
listPlain dirName baseName = do
    exists <- if null baseName
              then doesDirectoryExist dirName
              else doesNameExist (dirName </> baseName)
    return (if exists then [baseName] else [])listMatches :: FilePath -> String -> IO [String]
listMatches dirName pat = do
    dirName' <- if null dirName
                then getCurrentDirectory
                else return dirName
    handle (const (return [])) $ do
        names <- getDirectoryContents dirName'
        let names' = if isHidden pat
                     then filter isHidden names
                     else filter (not . isHidden) names
        return (filter (`matchesGlob` pat) names')

listPlain :: FilePath -> String -> IO [String]
listPlain dirName baseName = do
    exists <- if null baseName
              then doesDirectoryExist dirName
              else doesNameExist (dirName </> baseName)
    return (if exists then [baseName] else [])

If we look closely at the definition of listMatches above, we'll see a call to a function named handle. Earlier on, we imported this from the Control.Exception module; as that import implies, this gives us our first taste of exception handling in Haskell. Let's drop into ghci and see what we can find out.

ghci> :module +Control.Exception
ghci> :type handle
handle :: (Exception -> IO a) -> IO a -> IO a
ghci> :module +Control.Exception
ghci> :type handle
handle :: (Exception -> IO a) -> IO a -> IO a

This is telling us that handle takes two arguments. The first is a function that is passed an exception value, and can do I/O (we can see this because of the IO type in its return value); this is the handler to run if an exception is thrown. The second argument is the code to run that might throw an exception.

As for the exception handler, the type of the handle constrains it to return the same type of value as the body of code that threw the exception. So its choices are to either throw an exception or, as in our case, return a list of Strings.

The const function takes two arguments; it always returns its first argument, no matter what its second argument is.

ghci> :type const
const :: a -> b -> a
ghci> :type return []
return [] :: (Monad m) => m [a]
ghci> :type handle (const (return []))
handle (const (return [])) :: IO [a] -> IO [a]
ghci> :type const
const :: a -> b -> a
ghci> :type return []
return [] :: (Monad m) => m [a]
ghci> :type handle (const (return []))
handle (const (return [])) :: IO [a] -> IO [a]

We won't have anything more to say about exception handling here. There's plenty more to cover, though, so we'll be returning to the subject of exceptions in chapter FIXME.

Exercises

1.

Although we've gone to some lengths to write a portable namesMatching function, the function uses our case sensitive globToRegex function. Find a way to modify namesMatching to be case sensitive on Unix, and case insensitive on Windows, without modifying its type signature.

Hint: consider reading the documentation for System.FilePath to look for a variable that tells us whether we're running on a Unix-like system, or on Windows.

2.

If you're on a Unix-like system, look through the documentation for the System.Posix.Files module, and see if you can find a replacement for the doesNameExist function.

3.

The * wildcard only matches names within a single directory. Many shells have an extended wildcard syntax, **, that matches names recursively in all directories. For example, **.c would mean “match a name ending in .c in this directory or any subdirectory at any depth”. Implement matching on ** wildcards.

Handling errors through API design

It's not necessarily a disaster if our globToRegex is passed a malformed pattern. Perhaps a user mistyped a pattern, in which case we'd like to be able to report a meaningful error message.

Calling the error function when this kind of problem occurs can be a drastic response (exploring its consequences was the focus of exercise Q: 1). The error throws an exception. Pure Haskell code cannot deal with exceptions, so control is going to rocket out of our pure code into the nearest caller that lives in IO and has an appropriate exception handler installed. If no such handler is installed, the Haskell runtime will default to terminating our program (or print a nasty error message, in ghci).

So calling error is a little like pulling the handle of a fighter plane's ejection seat. We're bailing out of a catastrophic situation that we can't deal with gracefully, and there's likely to be a lot of flaming wreckage strewn about by the time we hit the ground.

We've established that error is for disasters, but we're still using it in globToRegex. In that case, malformed input should be rejected, but not turned into a big deal. What would be a better way to handle this?

Haskell's type system and libraries to the rescue! We can encode the possibility of failure in the type signature of globToRegex, using the predefined Either type.

type GlobError = String

globToRegex :: String -> Either GlobError Stringtype GlobError = String

globToRegex :: String -> Either GlobError String

A value returned by globToRegex will now be either Left "an error message" or Right "a valid regexp". This return type forces our callers to deal with the possibility of error. (You'll find that this use of the Either type occurs frequently in Haskell code.)

Exercises

1.

Write a version of globToRegex that uses the type signature above.

2.

Modify the type signature of namesMatching so that it encodes the possibility of a bad pattern, and make it call your rewritten globToRegex function.

Putting our code to work

The namesMatching function isn't very exciting by itself, but it's a useful building block. Combine it with a few more functions, and we can start to do interesting things.

Here's one such example. Let's define a renameWith function that, instead of simply renaming a file, applies a function to the file's name, and renames the file to whatever that function returns.

import System.FilePath (replaceExtension)
import System.Directory (doesFileExist, renameDirectory, renameFile)
import Glob (namesMatching)

renameWith :: (FilePath -> FilePath)
           -> FilePath
           -> IO FilePath

renameWith f path = do
    let path' = f path
    rename path path'
    return path'import System.FilePath (replaceExtension)
import System.Directory (doesFileExist, renameDirectory, renameFile)
import Glob (namesMatching)

renameWith :: (FilePath -> FilePath)
           -> FilePath
           -> IO FilePath

renameWith f path = do
    let path' = f path
    rename path path'
    return path'

Once again, we work around the ungainly file/directory split that portability has forced upon System.Directory with a helper function:

rename :: FilePath -> FilePath -> IO ()

rename old new = do
    isFile <- doesFileExist old
    let f = if isFile then renameFile else renameDirectory
    f old newrename :: FilePath -> FilePath -> IO ()

rename old new = do
    isFile <- doesFileExist old
    let f = if isFile then renameFile else renameDirectory
    f old new

The System.FilePath module provides many useful functions for manipulating file names. These functions mesh nicely with our renameWith and namesMatchingfunctions, so that we can quickly use them to create functions with complex behaviour. As an example, this terse function changes the file name suffixing convention for C++ source files.

cc2cpp name =
    mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc"cc2cpp name =
    mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc"

The cc2cpp function uses a few functions we'll be seeing over and over. The mapM function maps a function that can do I/O over a list. The flip function takes another function as argument, and swaps the order of its arguments (inspect the type of replaceExtension in ghci to see why). The =<< function feeds the result of its right hand side as an argument to its left.

Exercises

1.

Glob patterns are simple enough to interpret that it's easy to write a matcher directly in Haskell, rather than going through the regexp machinery. Give it a try.

Wrapping it up

Write some kind of intelligible conclusion.

Want to stay up to date? Subscribe to the comment feed for this chapter, or the entire book.

Copyright 2007 Bryan O'Sullivan, Don Stewart, and John Goerzen. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License. Icons by Paul Davey aka Mattahan.