[Write about lazy function definitions. Bryan O'Sullivan **20070613074510] { hunk ./en/Makefile 13 + ch07/append.hs \ hunk ./en/ch07-globs-and-regexps.xml 4 - Case study: regular expressions and file name matching + Case study: regular expressions and file name + matching hunk ./en/ch07-globs-and-regexps.xml 52 - the text foo in a pattern will match - foo, and only foo, in an - input string. + the text foo in a pattern will match + foo, and only foo, in + an input string. hunk ./en/ch07-globs-and-regexps.xml 73 - character not in this - class. + character not in this + class. hunk ./en/ch07-globs-and-regexps.xml 78 - character within this set. + character within this set. hunk ./en/ch07-globs-and-regexps.xml 86 - character (\ on Unix-like + character (\ on Unix-like hunk ./en/ch07-globs-and-regexps.xml 123 - The only function that - we're likely to need for normal use is the regexp matching - function, an infix operator named =~ - (borrowed from Perl. The first hurdle to overcome is that - Haskell's regexp libraries make heavy use of polymorphism. As a - result, the type signature of the =~ - operator is a bit difficult to understand. Let's drop into - ghci and see what it looks like. + The only function that we're likely to need for normal use + is the regexp matching function, an infix operator named + =~ (borrowed from Perl. The first hurdle + to overcome is that Haskell's regexp libraries make heavy use of + polymorphism. As a result, the type signature of the + =~ operator is a bit difficult to + understand. Let's drop into ghci and see + what it looks like. hunk ./en/ch07-globs-and-regexps.xml 148 - what type of result we would like. In real code, it may be - able to infer the right type, due to the way we subsequently - use the result. But such cues are often lacking when we're - exploring with ghci. If we omit a specific - type for the result, we'll get an intimidating-looking error + what type of result we would like. In real code, it may be + able to infer the right type, due to the way we subsequently + use the result. But such cues are often lacking when we're + exploring with ghci. If we omit a specific + type for the result, we'll get an intimidating-looking error hunk ./en/ch07-globs-and-regexps.xml 164 - condition="tyvar">target type, we tell it what we'd - like the type to be. If we want an expression of type + condition="tyvar">target type, we tell it what + we'd like the type to be. If we want an expression of type hunk ./en/ch07-globs-and-regexps.xml 172 - describes how a target type should behave; the - base library defines many instances of this typeclass for us. - The Bool type is an instance of this typeclass, - so we get back a usable result. Another such instance is - Int, which gives us a count of the number of - times the regexp matches. + describes how a target + type should behave; the base library defines many instances of + this typeclass for us. The Bool type is an + instance of this typeclass, so we get back a usable result. + Another such instance is Int, which gives us a + count of the number of times the regexp matches. hunk ./en/ch07-globs-and-regexps.xml 182 - first substring that matches, or an empty string if nothing - matches. + first substring that matches, or an empty string if nothing + matches. hunk ./en/ch07-globs-and-regexps.xml 208 - We can obtain quite a lot of information about the - context in which a match occurs. If we ask for a - three-element tuple, we'll get back the text - before the first match, the text - of that match, and the text that - follows it. + We can obtain quite a lot of information about the context + in which a match occurs. If we ask for a three-element tuple, + we'll get back the text before the first + match, the text of that match, and the + text that follows it. hunk ./en/ch07-globs-and-regexps.xml 228 - We can get numeric information about matches, - too. A pair of Ints gives us the starting offset - of the first match, and its length. If we ask for a list of - these pairs, we'll get this information for all - matches. + We can get numeric information about matches, too. A + pair of Ints gives us the starting offset of the + first match, and its length. If we ask for a list of these + pairs, we'll get this information for all matches. hunk ./en/ch07-globs-and-regexps.xml 245 - role="module">Text.Regex.Base.Context + role="module">Text.Regex.Base.Context hunk ./en/ch07-globs-and-regexps.xml 261 - However, be aware that if you want a string - value in the result of a match, the text you're matching - against must be the same type of string. Let's see what this - means in practice. + However, be aware that if you want a string value in the + result of a match, the text you're matching against must be + the same type of string. Let's see what this means in + practice. hunk ./en/ch07-globs-and-regexps.xml 301 - role="module">Text.Regex.Posix. As its name + role="module">Text.Regex.Posix. As its name hunk ./en/ch07-globs-and-regexps.xml 309 - role="module">Text.Regex.Posix module are - different in some significant ways from Perl-style - regexps. Here are a few of the more notable differences. + role="module">Text.Regex.Posix module are + different in some significant ways from Perl-style regexps. + Here are a few of the more notable differences. hunk ./en/ch07-globs-and-regexps.xml 322 - POSIX regexps have less uniform syntax - than Perl-style regexps. They also lack a number of - capabilities provided by Perl-style regexps, such as - zero-width assertions and control over greedy - matching. + POSIX regexps have less uniform syntax than Perl-style + regexps. They also lack a number of capabilities provided + by Perl-style regexps, such as zero-width assertions and + control over greedy matching. hunk ./en/ch07-globs-and-regexps.xml 339 - Translating a glob pattern into a regular expression + Translating a glob pattern into a regular + expression hunk ./en/ch07-globs-and-regexps.xml 396 - hunk ./en/ch07-globs-and-regexps.xml 432 - they may need to be escaped, so that the regexp engine won't - treat them specially. + they may need to be escaped. + + &GlobRegex.hs:last; hunk ./en/ch07-globs-and-regexps.xml 436 - &GlobRegex.hs:rest; + The escape function ensures that the + regexp engine will not interpret certain characters as pieces of + regular expression syntax. + + &GlobRegex.hs:escape; hunk ./en/ch07-globs-and-regexps.xml 451 + + + Writing lazy functions + + In an imperative language, the + globToRegex function is one that we'd + usually express as a loop. For example, Python's standard + fnmatch module includes a function named + translate that does exactly the same job as + our globToRegex function. It's written as + a loop. + + If you've been exposed to functional programming through a + language such as Scheme or ML, you've probably had drilled into + your head the notion that the way to emulate a loop is + via tail recursion. A function + usually needs a little local scratch storage when it executes. + A tail recursive function must either return a plain value, or + make a recursive call as the last thing it does (these kinds of + call are called tail calls). Since a function + making a tail call can by definition never use any of its local + scratch storage again, a language implementation can reuse that + space when it makes the tail call. This means that a series of + tail calls can execute in constant space, just as we take for + granted with a loop. + + Looking at the globToRegex' function, + we can see that it is not tail recursive. + To see why, let's examine its final clause again (several of its + other clauses are structured similarly). + + &GlobRegex.hs:last; + + It calls itself recursively, but this + isn't the last thing the function does. + The result of the recursive call is a parameter to the + (++) function. + + So what's going on here? Why are we not writing a tail + recursive definition for this function? The answer lies with + Haskell's non-strict evaluation strategy. Before we start + talking about that, let's quickly talk about why, in a + traditional language, we'd be trying to avoid this kind of + recursive definition. + + Returning to the clause above, + globToRegex' computes a little bit of its + result, then calls itself recursively, then returns the complete + result. In a traditional language, each recursive call is going + to require the allocation of temporary scratch memory until it + returns. + + Image: stack allocation for a recursive call to + globToRegex'. + + Compound these scratch allocations over a large number of + recursive calls, and the amount of space we use while processing + grows linearly with the size of the list we must process. + + Image: stack allocation for a pile of recursive + calls. + + For a problem like glob-to-regexp conversion, where we'll + always be dealing with very small amounts of data, this overhead + is insignificant. But if we had a hundred million elements to + process, and used plain recursion rather than tail recursion or + a loop to process them, we'd have a severe problem. + + Haskell neatly defangs this problem by deferring the + evaluation of an expression until its result is needed. To + understand how this deferred evaluation works, let's walk + through what happens when the final clause of the + globToRegex' is called. (As you read, bear + in mind that this is a simplified description.) + + In order to determine that this clause must be called, the + pattern matcher must inspect a little bit of the list it was + passed. The pattern requires it to do nothing more than + determine whether it has been given an empty list (in which case + the clause will not be evaluated), or a non-empty list (the case + we're interested in). Once the pattern matcher has established + that the list isn't empty, it goes no deeper. It doesn't look + inside c, the head of the list. It doesn't + follow cs, the tail. + + We have now decided to evaluate the right hand side of + the clause. The first expression we must evaluate is the call + to the list append function, (++). + (Because we don't need the results of escape c or + globToRegex' cs yet, we suspend the evaluation of + those functions, to evaluate when we'll need them.) So let's + remind ourselves what (++) looks + like: + + &append.hs:append; + + As the argument to (++) is not an + empty list, the second clause matches, so we must evaluate + x. This requires that we evaluate + escape c, which we had earlier deferred. Let's + look again at the definition of + escape. + + &GlobRegex.hs:escape; + + This will return a list whose first element is either the + \ character or the character it was passed. + Once we've constructed the first element of this list, we'll + defer constructing the rest of it until our caller needs it. We + suspend the evaluation of the rest of + escape so that we can pass this datum back + to our caller. Haskell implementations do this by saving away + enough information to resume the evaluation later; this saved + package of information is called a thunk. + + Not only is the evaluation of escape + suspended; so too are the evaluation of its callers, + (++) and globToRegex', + and so on up the call chain until some expression needs to + inspect the value. + + What would happen if the caller of + globToRegex' was for some reason the + function null? null + only cares whether its argument is an empty list: once it finds + this out, it's never going to look at the remainder of the + result of globToRegex'. None of the work + we deferred in order to get a partial result back to + null will ever actually occur. + + Remarkably, given this kind of evaluation strategy, our + recursive definition of globToRegex' will + execute in constant space (more or less). To see why, look at + the definition of (++); although it's + simpler, it has a similar recursive structure. + + If we put on our implementor's hats and think about how we + might suspend the evaluation of (++) in a + thunk, all we need to store are the current heads of the left + and right lists; the result of the next evaluation step depends + on nothing more. On each resumption of the thunk, we could + simply peel off the head of the left list and update the thunk + with the new head, repeating until we reach its end. Therefore, + we can execute in constant space. + addfile ./examples/ch07/append.hs hunk ./examples/ch07/append.hs 1 +{-# OPTIONS_GHC -fno-implicit-prelude #-} + +{-- snippet append --} +(++) :: [a] -> [a] -> [a] +[] ++ ys = ys +(x:xs) ++ ys = x : (xs ++ ys) +{-- /snippet append --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 37 -{-- snippet rest --} +{-- snippet last --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 39 +{-- /snippet last --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 41 +{-- snippet escape --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 47 -{-- /snippet rest --} +{-- /snippet escape --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 57 -name `matchesPattern` pat = name =~ (globToRegex pat) +name `matchesPattern` pat = name =~ globToRegex pat }