Table of Contents
Working with text strings is a fundamental tool in the programmer's toolbox. We've already been introduced to bytestrings and ropes as efficient ways to work with large text strings. Throughout this book, we'll be returning to string manipulation over and again, to show off the breadth of different techniques and libraries we can use to tackle a variety of problems that happen to involve working with strings.
In this chapter, we'll develop a library that can match file names against “glob-style” file name patterns. These patterns are a common feature of command shells. To begin, we'll introduce the pattern language that we'll be working with. Our library will translate these patterns into regular expressions, so we'll need to understand how to use regular expressions in Haskell. Next, we'll talk about how to use Haskell's standard directory listing functions with this library. Along the way, we'll talk a little about writing portable code, good API design, and constructing code from small building blocks.
Many “system-oriented” programming languages
provide library routines that let us match a file name against a
pattern, or that will give a list of files that match the
pattern. (In other languages, this function is often named
fnmatch
.) Although Haskell's standard
library generally has good “system programming”
facilities, it doesn't provide these kinds of pattern matching
functions. We'll take this as an opportunity to develop our
own.
The kinds of patterns we'll be dealing with are commonly referred to as “glob patterns” (the term we'll use), “wildcard patterns”, or “shell-style patterns”. These patterns have just a few simple rules. You probably already know them, but we'll quickly recap here.
Matching a string against a pattern starts at the beginning of the string, and finishes at the end.
Most literal characters match themselves. For example,
the text foo
in a pattern will match
foo
, and only foo
, in
an input string.
The *
(asterisk) character means
“match anything”; it will match any text,
including the empty string.
The ?
(question mark) character means
“match any single character”.
A [
(open square bracket) character
begins a character class, which is
ended by a ]
. Its meaning is
“match any character in this class”. A
character class can be negated by
following the opening [
with a
!
, so that it means “match any
character not in this
class”.
As a shorthand, a character followed by a
-
(dash), followed by another character,
denotes a range: “match any
character within this set”.
Character classes have an added subtlety; they can't be
empty. The first character after the opening
[
or [!
is part of the
class, so we can write a class containing the
]
character as
[]aeiou]
.
While Haskell doesn't provide a way to match glob patterns among its standard libraries, it has excellent regular expression matching facilities. Glob patterns are nothing more than cut-down regular expressions with slightly different syntax. It's easy to convert glob patterns into regular expressions, so that's what we'll do. But in order to do so, we must first understand how to use regular expressions in Haskell.
In this section, I'll be assuming that you are already familiar with regular expressions (which I'll abbreviate as regexps from here on) by way of some other language, such as Python, Perl, or Java. Rather than introduce them as something new, I'll be focusing on what's different about regexp handling in Haskell, compared to other languages. Haskell's regular expression matching libraries are a lot more expressive than those of other languages, so there's plenty to talk about.
To begin our exploration of the regexp libraries, the only
module we'll need to work with is Text.Regex.Posix
. As usual, the
most convenient way to explore this module is by interacting
with it via ghci.
ghci> :module +Text.Regex.Posixghci>
:module +Text.Regex.Posix
The only function that we're likely to need for normal use
is the regexp matching function, an infix operator named
=~
(borrowed from Perl. The first hurdle
to overcome is that Haskell's regexp libraries make heavy use of
polymorphism. As a result, the type signature of the
=~
operator is a bit difficult to
understand. Let's drop into ghci and see
what it looks like.
ghci> :type (=~) (=~) :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 target, Text.Regex.Base.RegexLike.RegexMaker Regex CompOption ExecOption source) => source1 -> source -> targetghci>
:type (=~)
(=~) :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 target, Text.Regex.Base.RegexLike.RegexMaker Regex CompOption ExecOption source) => source1 -> source -> target
Ouch! Rather than pick apart what this convoluted
definition means, let's take a shortcut. The
=~
operator uses typeclasses for both of
its arguments, and also for its return type. The first argument
(on the left of the =~
) is the text to
match; the second (on the right) is the regular expression to
match against. We can pass either a String or a
Data.ByteString as either argument.
The =~
operator is polymorphic in its
return type, so the Haskell compiler needs some way to know
what type of result we would like. In real code, it may be
able to infer the right type, due to the way we subsequently
use the result. But such cues are often lacking when we're
exploring with ghci. If we omit a specific
type for the result, we'll get an intimidating-looking error
message. Here's what ghci tells us if we
try to do a regexp match but omit the return type.
ghci> "my left foot" =~ "foo" <interactive>:1:0: No instance for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target) arising from use of `=~' at <interactive>:1:0-22 Possible fix: add an instance declaration for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target) In the expression: "my left foot" =~ "foo" In the definition of `it': it = "my left foot" =~ "foo"ghci>
"my left foot" =~ "foo"
<interactive>:1:0: No instance for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target) arising from use of `=~' at <interactive>:1:0-22 Possible fix: add an instance declaration for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] target) In the expression: "my left foot" =~ "foo" In the definition of `it': it = "my left foot" =~ "foo"
That's quite an impenetrable error message, but what it
means is that ghci can't infer what type of
result to give back to us. (This is the same kind of error
that a compiler would print if it couldn't infer the correct
type in a real program.) Here's the key to dealing with this
error. When ghci can't infer the target
type, we tell it what
we'd like the type to be. If we want an expression of type
Bool, we'll get a pass/fail result.
ghci> "my left foot" =~ "foo" :: Bool Loading package regex-base-0.71 ... linking ... done. Loading package regex-posix-0.71 ... linking ... done. True ghci> "your right hand" =~ "bar" :: Bool False ghci> "your right hand" =~ "(hand|foot)" :: Bool Trueghci>
"my left foot" =~ "foo" :: Bool
Loading package regex-base-0.71 ... linking ... done. Loading package regex-posix-0.71 ... linking ... done. Trueghci>
"your right hand" =~ "bar" :: Bool
Falseghci>
"your right hand" =~ "(hand|foot)" :: Bool
True
In the bowels of the base regexp library, there's a
typeclass named RegexContext
that
describes how a target
type should behave; the base library defines many instances of
this typeclass for us. The Bool type is an
instance of this typeclass, so we get back a usable result.
Another such instance is Int, which gives us a
count of the number of times the regexp matches.
ghci> "a star called henry" =~ "planet" :: Int 0 ghci> "honorificabilitudinitatibus" =~ "[aeiou]" :: Int 13ghci>
"a star called henry" =~ "planet" :: Int
0ghci>
"honorificabilitudinitatibus" =~ "[aeiou]" :: Int
13
If we ask for a String result, we'll get the first substring that matches, or an empty string if nothing matches.
ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String "ii" ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: String ""ghci>
"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String
"ii"ghci>
"hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: String
""
Another valid type of result is [String], which returns a list of all matching strings.
ghci> "I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String] ["ii","uu"] ghci> "hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: [String] []ghci>
"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String]
["ii","uu"]ghci>
"hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: [String]
[]
That's about it for “simple” result types, but we're not by any means finished. Before we continue, let's use a single pattern for our remaining examples. We can define this pattern as a variable in ghci, to save a little typing.
ghci> let pat = "(foo[a-z]*bar|quux)"ghci>
let pat = "(foo[a-z]*bar|quux)"
We can obtain quite a lot of information about the context in which a match occurs. If we ask for a three-element tuple, we'll get back the text before the first match, the text of that match, and the text that follows it.
ghci> "before foodiebar after" =~ pat :: (String,String,String) ("before ","foodiebar"," after")ghci>
"before foodiebar after" =~ pat :: (String,String,String)
("before ","foodiebar"," after")
If the match fails, the entire text is returned as the “before” element of the tuple, with the other two elements left empty.
ghci> "no match here" =~ pat :: (String,String,String) ("no match here","","")ghci>
"no match here" =~ pat :: (String,String,String)
("no match here","","")
Asking for a four-element tuple gives us a fourth element that's a list of all groups in the pattern that matched.
ghci> "before foodiebar after" =~ pat :: (String,String,String,[String]) ("before ","foodiebar"," after",["foodiebar"])ghci>
"before foodiebar after" =~ pat :: (String,String,String,[String])
("before ","foodiebar"," after",["foodiebar"])
We can get numeric information about matches, too. A pair of Ints gives us the starting offset of the first match, and its length. If we ask for a list of these pairs, we'll get this information for all matches.
ghci> "before foodiebar after" =~ pat :: (Int,Int) (7,9) ghci> "i foobarbar a quux" =~ pat :: [(Int,Int)] [(2,9),(14,4)]ghci>
"before foodiebar after" =~ pat :: (Int,Int)
(7,9)ghci>
"i foobarbar a quux" =~ pat :: [(Int,Int)]
[(2,9),(14,4)]
A failed match is represented by the value
-1
as the first element of the tuple (the
match offset) if we've asked for a single tuple, or an empty
list if we've asked for a list of tuples.
ghci> "eleemosynary" =~ pat :: (Int,Int) (-1,0) ghci> "mondegreen" =~ pat :: [(Int,Int)] []ghci>
"eleemosynary" =~ pat :: (Int,Int)
(-1,0)ghci>
"mondegreen" =~ pat :: [(Int,Int)]
[]
Believe it or not, this is not a
comprehensive list of built-in instances of the
RegexContext
typeclass. For a complete
list, see the documentation for the Text.Regex.Base.Context
module.
As we noted earlier, the =~
operator
uses typeclasses for its argument types and its return type.
Any combination of String and
ByteString will work for the types of both the
regular expression and the text to match against. Recall,
from our coverage of bytestrings in chapter
FIXME, that the pack
function takes a String and returns its
corresponding ByteString.
ghci> :module +Data.ByteString.Char8 ghci> :type pack "foo" pack "foo" :: ByteStringghci>
:module +Data.ByteString.Char8
ghci>
:type pack "foo"
pack "foo" :: ByteString
We can then try using different combinations of String and ByteString.
ghci> pack "foo" =~ "bar" :: Bool False ghci> "foo" =~ pack "bar" :: Int 0 ghci> pack "foo" =~ pack "o" :: [(Int, Int)] [(1,1),(2,1)]ghci>
pack "foo" =~ "bar" :: Bool
Falseghci>
"foo" =~ pack "bar" :: Int
0ghci>
pack "foo" =~ pack "o" :: [(Int, Int)]
[(1,1),(2,1)]
However, we need to be aware that if we want a string value in the result of a match, the text we're matching against must be the same type of string. Let's see what this means in practice.
ghci> pack "good food" =~ ".ood" :: [ByteString] ["good","food"]ghci>
pack "good food" =~ ".ood" :: [ByteString]
["good","food"]
In the above example, we've used the
pack
to turn a String into a
ByteString. The typechecker accepts this because
ByteString appears in the result type. But if we
try getting a String out, that
won't work.
ghci> "good food" =~ ".ood" :: [ByteString] <interactive>:1:0: No instance for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString]) arising from use of `=~' at <interactive>:1:0-20 Possible fix: add an instance declaration for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString]) In the expression: "good food" =~ ".ood" In the expression: "good food" =~ ".ood" :: [ByteString] In the definition of `it': it = "good food" =~ ".ood" :: [ByteString]ghci>
"good food" =~ ".ood" :: [ByteString]
<interactive>:1:0: No instance for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString]) arising from use of `=~' at <interactive>:1:0-20 Possible fix: add an instance declaration for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] [ByteString]) In the expression: "good food" =~ ".ood" In the expression: "good food" =~ ".ood" :: [ByteString] In the definition of `it': it = "good food" =~ ".ood" :: [ByteString]
And ouch! That's quite an error message. Don't let its verbosity confound you; we can easily fix this problem by making the string types of the left hand side and the result match once again.
ghci> "good food" =~ ".ood" :: [String] ["good","food"]ghci>
"good food" =~ ".ood" :: [String]
["good","food"]
This restriction does not apply to the type of the regexp we're matching against. It can be either a String or ByteString, unconstrained by the other types in use.
When you look through Haskell library documentation,
you'll see several regexp-related modules. The modules under
Text.Regex.Base
define
the common API adhered to by all of the other regexp modules.
It's possible to have multiple implementations of the regexp
API installed at one time. At the time of writing,
GHC is bundled with one
implementation, Text.Regex.Posix
. As its name
suggests, this package provides POSIX regexp semantics.
Other Haskell regexp packages are available for download on the Internet. Some provide faster engines than the current POSIX engine; others provide Perl-style matching capabilities. All follow the standard API that we have covered in this section.
Now that we've seen the myriad of ways to match text against regular expressions, let's turn our attention back to glob patterns. We want to write a function that will take a glob pattern and return its representation as a regular expression. Both glob patterns and regexps are text strings, so the type that our function ought to have seems clear.
globToRegex :: String -> StringglobToRegex :: String -> String
We start our definition of the
globToRegex
function by recalling that a
text string must match a glob pattern. Before we attempt to
convert any part of the glob pattern, we need to have a
“rooted” regular expression.
globToRegex cs = '^' : globToRegex' csglobToRegex cs = '^' : globToRegex' cs
Recall that the String is just a synonym for
[Char], a list of Chars. The
:
operator puts a value (the
^
character in this case) onto the front of a
list, where the list is the value returned by the yet-to-be-seen
globToRegex'
function.
That single-quote character in the name
globToRegex'
is not a typo, by the way;
Haskell is unusual in allowing single quotes as parts of
identifiers. A single quote is most likely to appear at the end
of a name, where it's often pronounced “prime”. (We
could, if we wanted to, name a function
head'n'tail'n'such
, but that's pretty
unusual.)
With the regular expression rooted, the
globToRegex'
function will do the bulk of
the translation work. We'll use the convenience of Haskell's
pattern matching to enumerate each of the cases we'll need to
cover.
globToRegex' :: String -> String globToRegex' "" = "$" globToRegex' ('*':cs) = ".*" ++ globToRegex' csglobToRegex' :: String -> String globToRegex' "" = "$" globToRegex' ('*':cs) = ".*" ++ globToRegex' cs
We now have a very minimal glob translator. Our first
clause stipulates that if we hit the end of our glob pattern (by
which time we'll be looking at the empty string), we return
$
, the regular expression symbol for
“match end-of-line”. The second gets us to
substitute the string .*
every time we see a
*
in our input list. And the third passes
every other character through, unaltered.
We can immediately save our new source file (let's call it
GlobRegexTiny.hs
and start playing with it
in ghci. This is a great way to do
exploratory programming: write a little code, load it into the
interpreter, and see what happens.
ghci> :load glob-to-regexp/GlobRegexTiny.hs [1 of 1] Compiling Main ( glob-to-regexp/GlobRegexTiny.hs, interpreted ) Ok, modules loaded: Main. ghci> globToRegex "*" "^.*$" ghci> globToRegex "" "^$" ghci> globToRegex "**" "^.*.*$"ghci>
:load glob-to-regexp/GlobRegexTiny.hs
[1 of 1] Compiling Main ( glob-to-regexp/GlobRegexTiny.hs, interpreted ) Ok, modules loaded: Main.ghci>
globToRegex "*"
"^.*$"ghci>
globToRegex ""
"^$"ghci>
globToRegex "**"
"^.*.*$"
These few lines of interaction might look trivial, but we have received immediate feedback on two fronts: our code passes the daunting scrutiny of Haskell's typechecker, and it produces sensible-looking results. Monkeying around with code in the interpreter early and often is consistently worthwhile.
The remaining cases for a complete definition of
globToRegex'
are easy to enumerate. First,
we get ?
out of the way.
globToRegex' ('?':cs) = '.' : globToRegex' csglobToRegex' ('?':cs) = '.' : globToRegex' cs
Should we revisit the different types of the
(++)
and (:)
operators
here? Newcomers often get confused over which to use.
More interesting is how we handle character classes.
globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs globToRegex' ('[':c:cs) = '[' : c : charClass cs globToRegex' ('[':_) = error "unterminated character class"globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs globToRegex' ('[':c:cs) = '[' : c : charClass cs globToRegex' ('[':_) = error "unterminated character class"
We take advantage of two behaviours of Haskell's pattern matcher here. The first is that it will match patterns in the order in which we declare them, so we place the most specific patterns first, and the least specific last. Secondly, there's no limit to the “depth” of a pattern: we can “peek” forwards into a list as far as we need to. (This isn't limited to simple lists, either. We can use the same capability to look inside nested structures.)
Here, the first pattern matches on three consecutive items of its input to ensure that a “negative” character class cannot be empty. The second pattern matches on only two items, so ensure that a “positive” character class cannot be empty. The final clause can only match if it's given a syntactically invalid character class.
Finally, we may need to escape some characters before we return them.
globToRegex' (c:cs) = escape c ++ globToRegex' csglobToRegex' (c:cs) = escape c ++ globToRegex' cs
The escape
function ensures that the
regexp engine will not interpret certain characters as pieces of
regular expression syntax.
escape :: Char -> String escape c | c `elem` regexChars = '\\' : [c] | otherwise = [c] where regexChars = "\\+()^$.{}]|"escape :: Char -> String escape c | c `elem` regexChars = '\\' : [c] | otherwise = [c] where regexChars = "\\+()^$.{}]|"
The charClass
helper function does
nothing more than ensure that a character class is correctly
terminated. It passes its input through unmolested until it
hits a ]
, when it hands control back to
globToRegex'
.
charClass :: String -> String charClass (']':cs) = ']' : globToRegex' cs charClass (c:cs) = c: charClass cs charClass [] = error "unterminated character class"charClass :: String -> String charClass (']':cs) = ']' : globToRegex' cs charClass (c:cs) = c: charClass cs charClass [] = error "unterminated character class"
Now that we've finished defining
globToRegex
and its helpers, let's load it
into ghci and try it out.
ghci> :load glob-to-regexp/GlobRegex.hs [1 of 1] Compiling GlobRegex ( glob-to-regexp/GlobRegex.hs, interpreted ) Ok, modules loaded: GlobRegex. ghci> :module +Text.Regex.Posix ghci> globToRegex "f??.c" Loading package regex-base-0.71 ... linking ... done. Loading package regex-posix-0.71 ... linking ... done. "^f..\\.c$"ghci>
:load glob-to-regexp/GlobRegex.hs
[1 of 1] Compiling GlobRegex ( glob-to-regexp/GlobRegex.hs, interpreted ) Ok, modules loaded: GlobRegex.ghci>
:module +Text.Regex.Posix
ghci>
globToRegex "f??.c"
Loading package regex-base-0.71 ... linking ... done. Loading package regex-posix-0.71 ... linking ... done. "^f..\\.c$"
Sure enough, that looks like a reasonable regexp. Can we use it to match against a string?
ghci> "foo.c" =~ globToRegex "f??.c" :: Bool True ghci> "test.c" =~ globToRegex "t[ea]s*" :: Bool True ghci> "taste.txt" =~ globToRegex "t[ea]s*" :: Bool Trueghci>
"foo.c" =~ globToRegex "f??.c" :: Bool
Trueghci>
"test.c" =~ globToRegex "t[ea]s*" :: Bool
Trueghci>
"taste.txt" =~ globToRegex "t[ea]s*" :: Bool
True
It works! Now let's play around a little with ghci. We
can create a temporary definition for
fnmatch
and try it out. (This is another
example of how ghci is great for exploratory programming.
We're not limited to defining temporary variables; we can also
introduce temporary functions when we need to.)
ghci> let fnmatch pat name = name =~ globToRegex pat :: Bool ghci> :type fnmatch fnmatch :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 Bool) => String -> source1 -> Bool ghci> fnmatch "myname" "d*" Falseghci>
let fnmatch pat name = name =~ globToRegex pat :: Bool
ghci>
:type fnmatch
fnmatch :: (Text.Regex.Base.RegexLike.RegexContext Regex source1 Bool) => String -> source1 -> Boolghci>
fnmatch "myname" "d*"
False
The name “fnmatch
” doesn't
really have the “Haskell nature”, though. The
typical Haskell style is for functions to have descriptive,
“camel cased” names. Camel casing name smashes
words together, capitalising each word in the resulting name;
the name comes from the “humps” introduced by the
capital letters. In our library, we'll give this function the
name matchesGlob
.
matchesGlob :: FilePath -> String -> Bool name `matchesGlob` pat = name =~ globToRegex patmatchesGlob :: FilePath -> String -> Bool name `matchesGlob` pat = name =~ globToRegex pat
In an imperative language, the
globToRegex'
function is one that we'd
usually express as a loop. For example, Python's standard
fnmatch module includes a function named
translate
that does exactly the same job as
our globToRegex
function. It's written as
a loop.
If you've been exposed to functional programming through a language such as Scheme or ML, you've probably had drilled into your head the notion that “the way to emulate a loop is via tail recursion”. A function usually needs a little local scratch storage when it executes. A tail recursive function must either return a plain value, or make a recursive call as the last thing it does (these kinds of call are called “tail calls”). Since a function making a tail call can by definition never use any of its local scratch storage again, a language implementation can reuse that space when it makes the tail call. This means that a series of tail calls can execute in constant space, just as we take for granted with a loop.
Looking at the globToRegex'
function,
we can see that it is not tail recursive.
To see why, let's examine its final clause again (several of its
other clauses are structured similarly).
globToRegex' (c:cs) = escape c ++ globToRegex' csglobToRegex' (c:cs) = escape c ++ globToRegex' cs
It calls itself recursively, but this
isn't the last thing the function does.
The result of the recursive call is a parameter to the
(++)
function.
So what's going on here? Why are we not writing a tail recursive definition for this function? The answer lies with Haskell's non-strict evaluation strategy. Before we start talking about that, let's quickly talk about why, in a traditional language, we'd be trying to avoid this kind of recursive definition.
Returning to the clause above,
globToRegex'
computes a little bit of its
result, then calls itself recursively, then returns the complete
result. In a traditional language, each recursive call is going
to require the allocation of temporary scratch memory until it
returns.
Image: stack allocation for a recursive call to
globToRegex'
.
Compound these scratch allocations over a large number of recursive calls, and the amount of space we use while processing grows linearly with the size of the list we must process.
Image: stack allocation for a pile of recursive calls.
For a problem like glob-to-regexp conversion, where we'll always be dealing with very small amounts of data, this overhead is insignificant. But if we had a hundred million elements to process, and used plain recursion rather than tail recursion or a loop to process them, we'd have a severe problem.
Haskell neatly defangs this problem by deferring the
evaluation of an expression until its result is needed. To
understand how this deferred evaluation works, let's walk
through what happens when the final clause of the
globToRegex'
is called. (As you read, bear
in mind that this is a simplified description.)
In order to determine that this clause must be called, the
pattern matcher must inspect a little bit of the list it was
passed. The pattern requires it to do nothing more than
determine whether it has been given an empty list (in which case
the clause will not be evaluated), or a non-empty list (the case
we're interested in). Once the pattern matcher has established
that the list isn't empty, it goes no deeper. It doesn't look
inside c
, the head of the list. It doesn't
follow cs
, the tail.
We have now decided to evaluate the right hand side of
the clause. The first expression we must evaluate is the call
to the list append function, (++)
.
(Because we don't need the results of escape c
or
globToRegex' cs
yet, we suspend the evaluation of
those functions, to evaluate when we'll need them.) So let's
remind ourselves what (++)
looks
like:
(++) :: [a] -> [a] -> [a] [] ++ ys = ys (x:xs) ++ ys = x : (xs ++ ys)(++) :: [a] -> [a] -> [a] [] ++ ys = ys (x:xs) ++ ys = x : (xs ++ ys)
As the argument to (++)
is not an
empty list, the second clause matches, so we must evaluate
x
. This requires that we evaluate
escape c
, which we had earlier deferred. Let's
look again at the definition of
escape
.
escape :: Char -> String escape c | c `elem` regexChars = '\\' : [c] | otherwise = [c] where regexChars = "\\+()^$.{}]|"escape :: Char -> String escape c | c `elem` regexChars = '\\' : [c] | otherwise = [c] where regexChars = "\\+()^$.{}]|"
This will return a list whose first element is either the
\
character or the character it was passed.
Once we've constructed the first element of this list, we'll
defer constructing the rest of it until our caller needs it. We
suspend the evaluation of the rest of
escape
so that we can pass this datum back
to our caller. Haskell implementations do this by saving away
enough information to resume the evaluation later; this saved
package of information is called a “thunk”.
Not only is the evaluation of escape
suspended; so too are the evaluation of its callers,
(++)
and globToRegex'
,
and so on up the call chain until some expression needs to
inspect the value.
What would happen if the caller of
globToRegex'
was for some reason the
function null
? null
only cares whether its argument is an empty list: once it finds
this out, it's never going to look at the remainder of the
result of globToRegex'
. None of the work
we deferred in order to get a partial result back to
null
will ever actually occur.
Remarkably, given this kind of evaluation strategy, our
recursive definition of globToRegex'
will
execute in constant space (more or less). To see why, look at
the definition of (++)
; although it's
simpler, it has a similar recursive structure.
If we put on our implementor's hats and think about how we
might suspend the evaluation of (++)
in a
thunk, all we need to store are the current heads of the left
and right lists; the result of the next evaluation step depends
on nothing more. On each resumption of the thunk, we could
simply peel off the head of the left list and update the thunk
with the new head, repeating until we reach its end. Therefore,
we can execute in constant space.
It's all very well to have a function that can match glob
patterns, but we'd like to be able to put this to practical use.
On Unix-like systems, the glob
function
returns the names of all files and directories that match a
given glob pattern. Let's build a similar function in Haskell.
Following the Haskell norm of descriptive naming, we'll call our
function namesMatching
.
module Glob (namesMatching) wheremodule Glob (namesMatching) where
This function will obviously have to manipulate filesystem paths a lot, splicing and joining them as it goes. We'll need to use a few previously unfamiliar modules along the way.
The System.Directory
module provides standard functions for working with directories
and their contents.
import System.Directory (doesDirectoryExist, doesFileExist, getCurrentDirectory, getDirectoryContents)import System.Directory (doesDirectoryExist, doesFileExist, getCurrentDirectory, getDirectoryContents)
The System.FilePath
module abstracts the details of an operating system's path name
conventions. Using this together with the System.Directory
module, we can
write a portable namesMatching
function
that will run on both Unix-like and Windows systems.
import System.FilePath (dropTrailingPathSeparator, splitFileName, (</>))import System.FilePath (dropTrailingPathSeparator, splitFileName, (</>))
Finally, we'll be emulating a “for” loop;
getting our first taste of exception handline in Haskell; and of
course using the matchesGlob
function we
just wrote.
import Control.Exception (handle) import Control.Monad (forM) import GlobRegex (matchesGlob)import Control.Exception (handle) import Control.Monad (forM) import GlobRegex (matchesGlob)
Since directories and files live in the “real
world” of activities that have effects, our globbing
function will have to have IO
in its
result type.
namesMatching :: String -> IO [FilePath]namesMatching :: String -> IO [FilePath]
If the string we're passed doesn't contain any pattern characters, all we need to do is check that the given name exists in the filesystem. (Notice that we use Haskell's function guard syntax here to write a nice tidy definition. An “if” would do, but isn't as aestheticall pleasing.)
isPattern :: String -> Bool isPattern = any (`elem` "[*?") namesMatching pat | not (isPattern pat) = do exists <- doesNameExist pat return (if exists then [pat] else [])isPattern :: String -> Bool isPattern = any (`elem` "[*?") namesMatching pat | not (isPattern pat) = do exists <- doesNameExist pat return (if exists then [pat] else [])
(You might have noticed that the function
doesNameExist
that we use here isn't in the
list of functions we imported from System.Directory
. We'll define it
shortly.)
What if the string is a glob pattern?
| otherwise = do | otherwise = do
We use splitFileName
to split the
string into a pair of “everything but the final
name” and “the final name”. If the first
element is empty, we're looking for a pattern in the current
directory.
case splitFileName pat of ("", baseName) -> do curDir <- getCurrentDirectory listMatches curDir baseName case splitFileName pat of ("", baseName) -> do curDir <- getCurrentDirectory listMatches curDir baseName
Otherwise, we must check the directory name and see if it contains patterns. If it does not, we create a singleton list of theb directory name. If it contains a pattern, we list all of the matching directories.
(dirName, baseName) -> do dirs <- if isPattern dirName then namesMatching (dropTrailingPathSeparator dirName) else return [dirName] (dirName, baseName) -> do dirs <- if isPattern dirName then namesMatching (dropTrailingPathSeparator dirName) else return [dirName]
Point out the use of return
above.
This will be one of the first times the reader will have
encountered it. Even if we've already discussed this topic
earlier, we should repeat ourselves.
Finally, we collect all matches in every directory, giving us a list of lists, and concatenate them into a single list of names.
let listDir = if isPattern baseName then listMatches else listPlain pathNames <- forM dirs $ \dir -> do baseNames <- listDir dir baseName return (map (dir </>) baseNames) return (concat pathNames) let listDir = if isPattern baseName then listMatches else listPlain pathNames <- forM dirs $ \dir -> do baseNames <- listDir dir baseName return (map (dir </>) baseNames) return (concat pathNames)
The unfamiliar forM
function above acts
a little like a “for” loop: it maps its second
argument (an action) over its first (a list), and returns the
resulting list.
We have a few loose ends to clean up. The first is the
definition of the doesNameExist
function,
used above. The System.Directory
doesn't let us
check to see if a name exists in the filesystem. It forces us
to think about whether we want to check for the existence of a
file or a directory, even if we don't care what kind of thing
the name is. This ungainly API is brought upon us by Windows.
We react by rolling the two checks into a single function, so
that we can ignore this portability quirk. In the name of
performance, we make the check for a file first, since files are
far more common than directories.
doesNameExist :: FilePath -> IO Bool doesNameExist name = do fileExists <- doesFileExist name if fileExists then return True else doesDirectoryExist namedoesNameExist :: FilePath -> IO Bool doesNameExist name = do fileExists <- doesFileExist name if fileExists then return True else doesDirectoryExist name
We have two other functions to define, each of which returns
a list of names in a directory. The
listMatches
function returns a list of all
files matching the given glob pattern in a directory, while the
listPlain
function returns either an empty
or singleton list, depending on whether the single name it's
passed exists.
listMatches :: FilePath -> String -> IO [String] listMatches dirName pat = do dirName' <- if null dirName then getCurrentDirectory else return dirName handle (const (return [])) $ do names <- getDirectoryContents dirName' let names' = if isHidden pat then filter isHidden names else filter (not . isHidden) names return (filter (`matchesGlob` pat) names') listPlain :: FilePath -> String -> IO [String] listPlain dirName baseName = do exists <- if null baseName then doesDirectoryExist dirName else doesNameExist (dirName </> baseName) return (if exists then [baseName] else [])listMatches :: FilePath -> String -> IO [String] listMatches dirName pat = do dirName' <- if null dirName then getCurrentDirectory else return dirName handle (const (return [])) $ do names <- getDirectoryContents dirName' let names' = if isHidden pat then filter isHidden names else filter (not . isHidden) names return (filter (`matchesGlob` pat) names') listPlain :: FilePath -> String -> IO [String] listPlain dirName baseName = do exists <- if null baseName then doesDirectoryExist dirName else doesNameExist (dirName </> baseName) return (if exists then [baseName] else [])
If we look closely at the definition of
listMatches
above, we'll see a call to a
function named handle
. Earlier on, we
imported this from the Control.Exception
module; as that
import implies, this gives us our first taste of exception
handling in Haskell. Let's drop into ghci and see what we can
find out.
ghci> :module +Control.Exception ghci> :type handle handle :: (Exception -> IO a) -> IO a -> IO aghci>
:module +Control.Exception
ghci>
:type handle
handle :: (Exception -> IO a) -> IO a -> IO a
This is telling us that handle
takes
two arguments. The first is a function that is passed an
exception value, and can do I/O (we can see this because of the
IO type in its return value); this is the handler
to run if an exception is thrown. The second argument is the
code to run that might throw an exception.
As for the exception handler, the type of the
handle
constrains it to return the same
type of value as the body of code that threw the exception. So
its choices are to either throw an exception or, as in our case,
return a list of Strings.
The const
function takes two arguments;
it always returns its first argument, no matter what its second
argument is.
ghci> :type const const :: a -> b -> a ghci> :type return [] return [] :: (Monad m) => m [a] ghci> :type handle (const (return [])) handle (const (return [])) :: IO [a] -> IO [a]ghci>
:type const
const :: a -> b -> aghci>
:type return []
return [] :: (Monad m) => m [a]ghci>
:type handle (const (return []))
handle (const (return [])) :: IO [a] -> IO [a]
We won't have anything more to say about exception handling here. There's plenty more to cover, though, so we'll be returning to the subject of exceptions in chapter FIXME.
It's not necessarily a disaster if our
globToRegex
is passed a malformed pattern.
Perhaps a user mistyped a pattern, in which case we'd like to be
able to report a meaningful error message.
Calling the error
function when this
kind of problem occurs can be a drastic response (exploring its
consequences was the focus of exercise Q: 1). The error
throws an exception. Pure Haskell code cannot deal with
exceptions, so control is going to rocket out of our pure code
into the nearest caller that lives in IO
and has an appropriate exception handler installed. If no such
handler is installed, the Haskell runtime will default to
terminating our program (or print a nasty error message, in
ghci).
So calling error
is a little like
pulling the handle of a fighter plane's ejection seat. We're
bailing out of a catastrophic situation that we can't deal with
gracefully, and there's likely to be a lot of flaming wreckage
strewn about by the time we hit the ground.
We've established that error
is for
disasters, but we're still using it in
globToRegex
. In that case, malformed input
should be rejected, but not turned into a big deal. What would
be a better way to handle this?
Haskell's type system and libraries to the rescue! We can
encode the possibility of failure in the type signature of
globToRegex
, using the predefined
Either type.
type GlobError = String globToRegex :: String -> Either GlobError Stringtype GlobError = String globToRegex :: String -> Either GlobError String
A value returned by globToRegex
will
now be either Left "an error message"
or
Right "a valid regexp"
. This return type
forces our callers to deal with the possibility of error.
(You'll find that this use of the Either type
occurs frequently in Haskell code.)
The namesMatching
function isn't very
exciting by itself, but it's a useful building block. Combine
it with a few more functions, and we can start to do interesting
things.
Here's one such example. Let's define a
renameWith
function that, instead of simply
renaming a file, applies a function to the file's name, and
renames the file to whatever that function returns.
import System.FilePath (replaceExtension) import System.Directory (doesFileExist, renameDirectory, renameFile) import Glob (namesMatching) renameWith :: (FilePath -> FilePath) -> FilePath -> IO FilePath renameWith f path = do let path' = f path rename path path' return path'import System.FilePath (replaceExtension) import System.Directory (doesFileExist, renameDirectory, renameFile) import Glob (namesMatching) renameWith :: (FilePath -> FilePath) -> FilePath -> IO FilePath renameWith f path = do let path' = f path rename path path' return path'
Once again, we work around the ungainly file/directory split
that portability has forced upon System.Directory
with a helper
function:
rename :: FilePath -> FilePath -> IO () rename old new = do isFile <- doesFileExist old let f = if isFile then renameFile else renameDirectory f old newrename :: FilePath -> FilePath -> IO () rename old new = do isFile <- doesFileExist old let f = if isFile then renameFile else renameDirectory f old new
The System.FilePath
module provides many useful functions for manipulating file
names. These functions mesh nicely with our
renameWith
and
namesMatching
functions, so that we can
quickly use them to create functions with complex behaviour. As
an example, this terse function changes the file name suffixing
convention for C++ source files.
cc2cpp name = mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc"cc2cpp name = mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc"
The cc2cpp
function uses a few
functions we'll be seeing over and over. The
mapM
function maps a function that can do
I/O over a list. The flip
function takes
another function as argument, and swaps the order of its
arguments (inspect the type of
replaceExtension
in ghci to see why).
The =<<
function feeds the result of
its right hand side as an argument to its left.