[ch09 done Bryan O'Sullivan **20080804083033] { addfile ./examples/ch09/prices.csv addfile ./examples/ch09/HighestClose.hs addfile ./examples/ch09/filepath.ghci addfile ./examples/ch09/SumFile.hs hunk ./en/ch09-globs-and-regexps.xml 4 - Case study: regular expressions and file name - matching + Efficient file processing, regular expressions, and file + name matching hunk ./en/ch09-globs-and-regexps.xml 7 - Working with text strings is a fundamental tool in the - programmer's toolbox. We've already been introduced to - bytestrings and ropes as efficient ways to work with large text - strings. Throughout this book, we'll be returning to string - manipulation over and again, to show off the breadth of different - techniques and libraries we can use to tackle a variety of - problems that happen to involve working with strings. + + Efficient file processing hunk ./en/ch09-globs-and-regexps.xml 10 - In this chapter, we'll develop a library that can match file - names against glob-style file name patterns. These - patterns are a common feature of command shells. To begin, we'll - introduce the pattern language that we'll be working with. Our - library will translate these patterns into regular expressions, so - we'll need to understand how to use regular expressions in - Haskell. Next, we'll talk about how to use Haskell's standard - directory listing functions with this library. Along the way, - we'll talk a little about writing portable code, good API design, - and constructing code from small building blocks. + This simple microbenchmark reads a text file full of + numbers, and prints their sum. + + &SumFile.hs:main; + + Although the String type is the default used + for reading and writing files, it is not efficient, so a simple + program like this will perform badly. + + A String is represented as a list of + Char values; each element of a list is allocated + individually, and has some book-keeping overhead. These factors + affect the memory consumption and performance of a program that + must read or write text or binary data. On simple benchmarks + like this, even programs written in interpreted languages such + as Python can outperform Haskell code that uses + String by an order of magnitude. + + The bytestring library provides a fast, cheap + alternative to the String type. Code written with + bytestring can often match or exceed the + performance and memory footprint of C, while maintaining + Haskell's expressivity and conciseness. + + The library supplies two modules. Each defines functions + that are nearly drop-in replacements for their + String counterparts. + + + + The Data.ByteString module defines a + strict type named + ByteString. This represents a string of binary + or text data in a single array. + + + + The Data.ByteString.Lazy module provides a + lazy type, also named + ByteString. This represents a string of data + as a list of chunks, arrays of up to + 64KB in size. + + + + Each ByteString type performs better under + particular circumstances. For streaming a large quantity + (hundreds of megabytes to terabytes) of data, the lazy + ByteString type is usually best. Its chunk size is + tuned to be friendly to a modern CPU's L1 cache, and a garbage + collector can quickly discard chunks of streamed data that are + no longer being used. + + The strict ByteString type performs best for + applications that are less concerned with memory footprint, or + that need to access data randomly. + + + Binary I/O and qualified imports + + Let's develop a small function to illustrate some of the + ByteString API. We will determine if a file is + an ELF object file: this is the format used for executables on + almost all modern Unix-like systems. + + This is a simple matter of looking at the first four bytes + in the file, and seeing if they match a specific sequence of + bytes. A byte sequence that identifies a file's type is often + known as a magic number. + + &ElfMagic.hs:hasElfMagic; + + We import the ByteString modules using + Haskell's qualified import syntax, the + import qualified that we see above. This lets us + refer to a module with a name of our choosing. + + For instance, when we want to refer to the lazy + ByteString module's take + function, we must write L.take, since we + imported the module under the name L. If we + are not explicit about which version of e.g. + take we want, the compiler will report an + error. + + We will always use qualified import syntax with the + ByteString modules, because they provide many + functions that have the same names as Prelude + functions. + + + Qualified imports make it easy to switch between + ByteString types. All you should need to do is + modify an import declaration at the top of your + source file; the rest of your code will probably not need + any changes. You can thus handily benchmark the two types, + to see which is best suited to your application's needs + + + Whether or not we use qualified imports, we can always use + the entire name of a module to identify something + unambiguously. For instance, both + Data.ByteString.Lazy.length and + L.length identify the same function, as + do Prelude.sum and + sum. + + The lazy and strict ByteString modules are + intended for binary I/O. The Haskell data type for + representing bytes is Word8; if we need to refer + to it by name, we import it from the Data.Word + module. + + The L.pack function takes a list of + Word8 values, and packs them into a lazy + ByteString. (The L.unpack + function performs the reverse conversion.) Our + hasElfMagic function simply compares the + first four bytes of a ByteString against a magic + number. + + We are writing in classic Haskell style, where our + hasElfMagic function does not perform + I/O. Here is the function that uses it on a file. + + &ElfMagic.hs:isElfFile; + + The L.readFile function is the lazy + ByteString equivalent of + readFile. It operates lazily, reading + the file as data is demanded. It is also efficient, reading + chunks of up to 64KB at once. The lazy + ByteString is a good choice for our task: since + we only need to read at most the first four bytes of the file, + we can safely use this function on a file of any size. + + + + Text I/O + + For convenience, the bytestring library + provides two other modules with limited text I/O capabilities, + Data.ByteString.Char8 and + Data.ByteString.Lazy.Char8. These expose + individual string elements as Char instead of + Word8. + + + The functions in these modules only work with byte-sized + Char values, so they are only suitable for use + with ASCII and some European character sets. Values above + 255 are truncated. + + + The character-oriented bytestring modules + provide useful functions for text processing. Here is a file + that contains monthly stock prices for a well-known Internet + company from mid-2008. + + &highestClose.ghci:readFile; + + How can we find the highest closing price from a series of + entries like this? Closing prices are in the fourth + comma-separated column. This function obtains a closing price + from one line of data. + + &HighestClose.hs:closing; + + Since this function is written in point-free style, we + read from right to left. The L.split + function splits a lazy ByteString into a list of + them, every time it finds a matching character. The + (!!) operator retrieves the + kth element of a list. Our + readPrice function turns a string + representing a fractional price into a whole number. + + &HighestClose.hs:readPrice; + + We use the L.readInt function, which + parses an integer. It returns both the integer and the + remainder of the string once a run of digits is consumed. Our + definition is slightly complicated by + L.readInt returning Nothing + if parsing fails. + + Our function for finding the highest closing price is + straightforward. + + &HighestClose.hs:highestClose; + + We use one trick to work around the fact that we cannot + supply an empty list to the maximum + function. + + &highestClose.ghci:maximum; + + Since we do not want our code to throw an exception if we + have no stock data, the (Nothing:) expression + ensures that the list of Maybe Int values that we + supply to maximum will never be + empty. + + &highestClose.ghci:maxList; + + Does our function work? + + &highestClose.ghci:highestClose; + + Since we have separated our I/O from our logic, we can + test the no-data case without having to create an empty + file. + + &highestClose.ghci:highestEmpty; + + + hunk ./en/ch09-globs-and-regexps.xml 231 - Many system-oriented programming languages + Many systems-oriented programming languages hunk ./en/ch09-globs-and-regexps.xml 234 - pattern. (In other languages, this function is often named + pattern. In other languages, this function is often named hunk ./en/ch09-globs-and-regexps.xml 236 - library generally has good system programming + library generally has good systems programming hunk ./en/ch09-globs-and-regexps.xml 241 - The kinds of patterns we'll be dealing with are commonly - referred to as glob patterns (the term we'll - use), wild card patterns, or shell-style - patterns. These patterns have just a few simple - rules. You probably already know them, but we'll quickly recap - here. + The kinds of patterns we'll be dealing with are + commonly referred to as glob patterns (the + term we'll use), wild card patterns, or shell-style patterns. + They have just a few simple rules. You probably already know + them, but we'll quickly recap here. hunk ./en/ch09-globs-and-regexps.xml 259 - The * (asterisk) character means - match anything; it will match any text, - including the empty string. + The * (asterisk) character + means match anything; it will match any text, + including the empty string. For instance, the pattern + foo* will match any string that begins with + foo, such as foo itself, + foobar, or foo.c. The pattern + quux*.c will match any string that begins with + quux and ends in .c, such as + quuxbaz.c. hunk ./en/ch09-globs-and-regexps.xml 270 - The ? (question mark) character means - match any single character. + The ? (question mark) + character matches any single character. The pattern + pic??.jpg will match names like + picaa.jpg or pic01.jpg. hunk ./en/ch09-globs-and-regexps.xml 289 - Character classes have an added subtlety; they can't be - empty. The first character after the opening + Character classes have an added subtlety; they + can't be empty. The first character after the opening hunk ./en/ch09-globs-and-regexps.xml 294 - []aeiou]. + []aeiou]. The pattern + pic[0-9].[pP][nN][gG] will match a name + consisting of the string pic, followed by a + single digit, followed by any capitalization of the strig + .png. hunk ./en/ch09-globs-and-regexps.xml 302 - While Haskell doesn't provide a way to match glob patterns - among its standard libraries, it has excellent regular - expression matching facilities. Glob patterns are nothing more - than cut-down regular expressions with slightly different + While Haskell doesn't provide a way to match glob + patterns among its standard libraries, it provides a good + regular expression matching library. Glob patterns are nothing + more than cut-down regular expressions with slightly different hunk ./en/ch09-globs-and-regexps.xml 307 - expressions, so that's what we'll do. But in order to do so, we - must first understand how to use regular expressions in - Haskell. + expressions, but to do so, we must first understand how to use + regular expressions in Haskell. hunk ./en/ch09-globs-and-regexps.xml 314 - In this section, I'll be assuming that you are already - familiar with regular expressions (which I'll abbreviate as - regexps from here on) by way of some other - language, such as Python, Perl, or Java. Rather than introduce - them as something new, I'll be focusing on what's different - about regexp handling in Haskell, compared to other languages. - Haskell's regular expression matching libraries are a lot more - expressive than those of other languages, so there's plenty to - talk about. + In this section, we will be assume that you are + already familiar with regular expressions by way of some other + language, such as Python, Perl, or Java + If you are not acquainted with regular expressions, we + recommend Jeffrey Friedl's book Mastering Regular + Expressions. + . + + For brevity, we will abbreviate regular + expression as regexp from here + on. + + + Rather than introduce regexps as something new, we will + focus on what's different about regexp handling in Haskell, + compared to other languages. Haskell's regular expression + matching libraries are a lot more expressive than those of other + languages, so there's plenty to talk about. hunk ./en/ch09-globs-and-regexps.xml 333 - To begin our exploration of the regexp libraries, the only - module we'll need to work with is To begin our exploration of the regexp libraries, + the only module we'll need to work with is The only function that we're likely to need for normal use - is the regexp matching function, an infix operator named - =~ (borrowed from Perl. The first hurdle - to overcome is that Haskell's regexp libraries make heavy use of - polymorphism. As a result, the type signature of the - =~ operator is a bit difficult to - understand. Let's drop into ghci and see - what it looks like. + The only function that we're likely to need for + normal use is the regexp matching function, an infix operator + named (=~) (borrowed from Perl). The first + hurdle to overcome is that Haskell's regexp libraries make heavy + use of polymorphism. As a result, the type signature of the + (=~) operator is difficult to understand, + so we will not explain it here. hunk ./en/ch09-globs-and-regexps.xml 349 - ®exp.ghci:typetwiddle; - - Ouch! Rather than pick apart what this convoluted - definition means, let's take a shortcut. The + The hunk ./en/ch09-globs-and-regexps.xml 355 - Data.ByteString as either argument. + ByteString as either argument. hunk ./en/ch09-globs-and-regexps.xml 360 - The =~ operator is polymorphic in its - return type, so the Haskell compiler needs some way to know - what type of result we would like. In real code, it may be - able to infer the right type, due to the way we subsequently - use the result. But such cues are often lacking when we're - exploring with ghci. If we omit a specific - type for the result, we'll get an intimidating-looking error - message. Here's what ghci tells us if we - try to do a regexp match but omit the return type. - - ®exp.ghci:noreturn; + The =~ operator is + polymorphic in its return type, so the Haskell compiler needs + some way to know what type of result we would like. In real + code, it may be able to infer the right type, due to the way + we subsequently use the result. But such cues are often + lacking when we're exploring with ghci. If + we omit a specific type for the result, we'll get an + error from the interpreter, as it does not have enough + information to successfuly infer the result type. hunk ./en/ch09-globs-and-regexps.xml 370 - That's quite an impenetrable error message, but what it - means is that ghci can't infer what type of - result to give back to us. (This is the same kind of error - that a compiler would print if it couldn't infer the correct - type in a real program.) Here's the key to dealing with this - error. When ghci can't infer the target type, we tell it what - we'd like the type to be. If we want an expression of type - Bool, we'll get a pass/fail result. + When ghci can't infer the + target type, we tell it + what we'd like the type to be. If we want a result of + type Bool, we'll get a pass/fail answer. hunk ./en/ch09-globs-and-regexps.xml 377 - In the bowels of the base regexp library, there's a - typeclass named RegexContext that - describes how a target - type should behave; the base library defines many instances of - this typeclass for us. The Bool type is an - instance of this typeclass, so we get back a usable result. - Another such instance is Int, which gives us a - count of the number of times the regexp matches. + In the bowels of the regexp libraries, + there's a typeclass named RegexContext + that describes how a target type should behave; the + base library defines many instances of this typeclass for us. + The Bool type is an instance of this typeclass, + so we get back a usable result. Another such instance is + Int, which gives us a count of the number of + times the regexp matches. hunk ./en/ch09-globs-and-regexps.xml 389 - If we ask for a String result, we'll get the - first substring that matches, or an empty string if nothing - matches. + If we ask for a String result, + we'll get the first substring that matches, or an empty string + if nothing matches. hunk ./en/ch09-globs-and-regexps.xml 395 - Another valid type of result is [String], - which returns a list of all matching - strings. + Another valid type of result is + [String], which returns a list of + all matching strings. hunk ./en/ch09-globs-and-regexps.xml 402 - If you want a result that's a plain String, - beware. Since (=~) returns an empty - string to signify no match, this poses an - obvious difficulty if the empty string could also be a valid - match for the regexp. If such a case arises, you should use - a different return type instead. - - FIXME: Find some place to mention the monadic - =~~ operator. + If you want a result that's a plain + String, beware. Since + (=~) returns an empty string to signify + no match, this poses an obvious difficulty if + the empty string could also be a valid match for the regexp. + If such a case arises, you should use a different return + type instead, such as [String]. hunk ./en/ch09-globs-and-regexps.xml 411 - That's about it for simple result types, - but we're not by any means finished. Before we continue, - let's use a single pattern for our remaining examples. We can - define this pattern as a variable in &ghci;, to save a little - typing. + That's about it for simple result + types, but we're not by any means finished. Before we + continue, let's use a single pattern for our remaining + examples. We can define this pattern as a variable in &ghci;, + to save a little typing. hunk ./en/ch09-globs-and-regexps.xml 419 - We can obtain quite a lot of information about the context - in which a match occurs. If we ask for a three-element tuple, - we'll get back the text before the first - match, the text of that match, and the - text that follows it. + We can obtain quite a lot of information about + the context in which a match occurs. If we ask for a + (String, String, String) tuple, we'll get back the text + before the first match, the text + of that match, and the text that + follows it. hunk ./en/ch09-globs-and-regexps.xml 428 - If the match fails, the entire text is returned as the - before element of the tuple, with the other two - elements left empty. + If the match fails, the entire text is returned + as the before element of the tuple, with the + other two elements left empty. hunk ./en/ch09-globs-and-regexps.xml 434 - Asking for a four-element tuple gives us a fourth element - that's a list of all groups in the pattern that + Asking for a four-element tuple gives us a + fourth element that's a list of all groups in the pattern that hunk ./en/ch09-globs-and-regexps.xml 440 - We can get numeric information about matches, too. A - pair of Ints gives us the starting offset of the - first match, and its length. If we ask for a list of these - pairs, we'll get this information for all matches. + We can get numeric information about matches, + too. A pair of Ints gives us the starting offset + of the first match, and its length. If we ask for a list of + these pairs, we'll get this information for all + matches. hunk ./en/ch09-globs-and-regexps.xml 455 - Believe it or not, this is not a - comprehensive list of built-in instances of the - RegexContext typeclass. For a complete - list, see the documentation for the Text.Regex.Base.Context + This is not a comprehensive list of built-in + instances of the RegexContext + typeclass. For a complete list, see the documentation for the + Text.Regex.Base.Context hunk ./en/ch09-globs-and-regexps.xml 461 + This ability to make a function polymorphic in its result + type is an unusual feature for a statically typed language. + hunk ./en/ch09-globs-and-regexps.xml 465 + + + + More about regular expressions hunk ./en/ch09-globs-and-regexps.xml 473 - As we noted earlier, the =~ operator - uses typeclasses for its argument types and its return type. - Any combination of String and - ByteString will work for the types of both the - regular expression and the text to match against. Recall, - from our coverage of bytestrings in chapter - FIXME, that the pack - function takes a String and returns its - corresponding ByteString. + As we noted earlier, the =~ + operator uses typeclasses for its argument types and its + return type. We can use either String or + strict ByteString values for both the + regular expression and the text to match against. hunk ./en/ch09-globs-and-regexps.xml 486 - However, we need to be aware that if we want a string - value in the result of a match, the text we're matching + However, we need to be aware that if we want a + string value in the result of a match, the text we're matching hunk ./en/ch09-globs-and-regexps.xml 495 - ByteString. The type checker accepts this because - ByteString appears in the result type. But if we - try getting a String out, that + ByteString. The type checker accepts this + because ByteString appears in the result type. + But if we try getting a String out, that hunk ./en/ch09-globs-and-regexps.xml 502 - And ouch! That's quite an error message. Don't let its - verbosity confound you; we can easily fix this problem by - making the string types of the left hand side and the result - match once again. + We can easily fix this problem by making the + string types of the left hand side and the result match once + again. hunk ./en/ch09-globs-and-regexps.xml 508 - This restriction does not apply to - the type of the regexp we're matching against. It can be - either a String or ByteString, - unconstrained by the other types in use. + This restriction does not + apply to the type of the regexp we're matching against. It + can be either a String or + ByteString, unconstrained by the other types in + use. hunk ./en/ch09-globs-and-regexps.xml 518 - When you look through Haskell library documentation, - you'll see several regexp-related modules. The modules under - Text.Regex.Base define - the common API adhered to by all of the other regexp modules. - It's possible to have multiple implementations of the regexp - API installed at one time. At the time of writing, + When you look through Haskell library + documentation, you'll see several regexp-related modules. The + modules under Text.Regex.Base define the common + API adhered to by all of the other regexp modules. It's + possible to have multiple implementations of the regexp API + installed at one time. At the time of writing, hunk ./en/ch09-globs-and-regexps.xml 531 - If you're coming to Haskell from a language like Perl, - Python, or Java, and you've used regular expressions in one - of those languages, you should be aware that the POSIX - regexps handled by the If you're coming to Haskell from a language + like Perl, Python, or Java, and you've used regular + expressions in one of those languages, you should be aware + that the POSIX regexps handled by the Perl regexp engines do left-biased matching when - matching alternatives, whereas POSIX engines choose the + Perl regexp engines perform left-biased matching + when matching alternatives, whereas POSIX engines choose the hunk ./en/ch09-globs-and-regexps.xml 548 - POSIX regexps have less uniform syntax than Perl-style - regexps. They also lack a number of capabilities provided - by Perl-style regexps, such as zero-width assertions and - control over greedy matching. + POSIX regexps have less uniform syntax than + Perl-style regexps. They also lack a number of capabilities + provided by Perl-style regexps, such as zero-width + assertions and control over greedy matching. hunk ./en/ch09-globs-and-regexps.xml 554 - Other Haskell regexp packages are available for download - on the Internet. Some provide faster engines than the current - POSIX engine; others provide Perl-style matching capabilities. - All follow the standard API that we have covered in this + Other Haskell regexp packages are available for + download from Hackage. Some provide better performance than + the current POSIX engine (e.g. regex-tdfa); + others provide the Perl-style matching that most programmers + are now familiar with (e.g. regex-pcre). All + follow the standard API that we have covered in this hunk ./en/ch09-globs-and-regexps.xml 579 - We start our definition of the - globToRegex function by recalling that a - text string must match a glob pattern. Before we attempt to - convert any part of the glob pattern, we need to have a - rooted regular expression. + The regular expression that we generate must be + anchored, so that it starts matching from + the beginning of a string and finishes at the end. hunk ./en/ch09-globs-and-regexps.xml 592 - That single-quote character in the name - globToRegex'is not a typo, by the way; - Haskell is unusual in allowing single quotes as parts of - identifiers. A single quote is most likely to appear at the end - of a name, where it's often pronounced prime. (We - could, if we wanted to, name a function - head'n'tail'n'such, but that's pretty - unusual.) - hunk ./en/ch09-globs-and-regexps.xml 598 - This grants us the flexibility to structure your code in the - manner that makes most logical sense to us, as opposed to a - way that makes the compiler writer's life easiest. + This grants us the flexibility to structure our code in the + manner that makes most logical sense to us, rather than follow + an order that makes the compiler writer's life easiest. hunk ./en/ch09-globs-and-regexps.xml 602 - Haskell module writers often use this flexibility to put - more important code earlier in a source file, - relegating plumbing to later. In fact, that's - exactly how we're presenting the + Haskell module writers often use this + flexibility to put more important code earlier + in a source file, relegating plumbing to later. + This is exactly how we are presenting the hunk ./en/ch09-globs-and-regexps.xml 618 - We now have a very minimal glob translator. Our first + Our first hunk ./en/ch09-globs-and-regexps.xml 622 - match end-of-line. The second gets us to - substitute the string .* every time we see a - * in our input list. And the third passes - every other character through, unaltered. - - We can immediately save our new source file (let's call it - GlobRegexTiny.hs and start playing with it - in ghci. This is a great way to do - exploratory programming: write a little code, load it into the - interpreter, and see what happens. - - &glob-regexp.ghci:tiny; - - These few lines of interaction might look trivial, but we - have received immediate feedback on two fronts: our code passes - the daunting scrutiny of Haskell's type checker, and it produces - sensible-looking results. Monkeying around with code in the - interpreter early and often is consistently worthwhile. - - The remaining cases for a complete definition of - globToRegex' are easy to enumerate. First, - we get ? out of the way. - - &GlobRegex.hs:question; - - Should we revisit the different types of the - (++) and (:) operators - here? Newcomers often get confused over which to use. - - More interesting is how we handle character classes. - - &GlobRegex.hs:class; - - We take advantage of two behaviours of Haskell's pattern - matcher here. The first is that it will match patterns in the - order in which we declare them, so we place the most specific - patterns first, and the least specific last. Secondly, there's - no limit to the depth of a pattern: we can - peek forwards into a list as far as we need to. - (This isn't limited to simple lists, either. We can use the - same capability to look inside nested structures.) - - Here, the first pattern matches on three consecutive items - of its input to ensure that a negative character - class cannot be empty. The second pattern matches on only two - items, so ensure that a positive character class - cannot be empty. The final clause can only match if it's given - a syntactically invalid character class. - - Finally, we may need to escape some characters before we - return them. - - &GlobRegex.hs:last; + match end-of-line. Following this is a series of + clauses that switch our pattern from glob syntax to regexp syntax. + The last clause passes every other character through, possibly + escaping it first. hunk ./en/ch09-globs-and-regexps.xml 633 - The charClass helper function does - nothing more than ensure that a character class is correctly - terminated. It passes its input through unmolested until it - hits a ], when it hands control back to + The charClass helper function + only checks that a character class is correctly terminated. It + passes its input through unmodified until it hits a + ], when it hands control back to hunk ./en/ch09-globs-and-regexps.xml 652 - It works! Now let's play around a little with &ghci;. We - can create a temporary definition for - fnmatch and try it out. (This is another - example of how &ghci; is great for exploratory programming. - We're not limited to defining temporary variables; we can also - introduce temporary functions when we need to.) + It works! Now let's play around a little with + &ghci;. We can create a temporary definition for + fnmatch and try it out. hunk ./en/ch09-globs-and-regexps.xml 658 - The name fnmatch doesn't - really have the Haskell nature, though. The - typical Haskell style is for functions to have descriptive, - camel cased names. Camel casing name smashes - words together, capitalising each word in the resulting name; - the name comes from the humps introduced by the - capital letters. In our library, we'll give this function the - name matchesGlob. + The name fnmatch doesn't + really have the Haskell nature, though. By far + the most common Haskell style is for functions to have + descriptive, camel cased names. Camel casing + concatenates words, capitalising all but possibly the first + word. For instance, the words file name matches + would become the name fileNameMatches. The name + camel case comes from the humps + introduced by the capital letters. In our library, we'll give + this function the name matchesGlob. hunk ./en/ch09-globs-and-regexps.xml 671 - - The choice of naming conventions and coding styles is a - perennial source of heated discussion, perhaps because the - stakes are so low that everyone feels entitled to an opinion. - When we introduce style-related notions like camel cased - identifiers, we're showing you what Haskell convention looks - like. We're not making a statement about the relative merits - of different people's preferences. All we - do claim is that it's better to fit in - with an existing house style than to forge off - in a novel stylistic direction. - - + You may have noticed that most of the names that we have + used for variables so far have been short. As a rule of thumb, + descriptive variable names are more useful in longer function + definitions, as they aid readability. For a two-line function, + a long variable name has less value. hunk ./en/ch09-globs-and-regexps.xml 717 - If you've been exposed to functional programming through a - language such as Scheme or ML, you've probably had drilled into - your head the notion that the way to emulate a loop is - via tail recursion. A function - usually needs a little local scratch storage when it executes. - A tail recursive function must either return a plain value, or - make a recursive call as the last thing it does (these kinds of - call are called tail calls). Since a function - making a tail call can by definition never use any of its local - scratch storage again, a language implementation can reuse that - space when it makes the tail call. This means that a series of - tail calls can execute in constant space, just as we take for - granted with a loop. - - Looking at the globToRegex' function, - we can see that it is not tail recursive. - To see why, let's examine its final clause again (several of its - other clauses are structured similarly). - - &GlobRegex.hs:last.noid; + If you've been exposed to functional programming + through a language such as Scheme or ML, you've probably had + drilled into your head the notion that the way to emulate + a loop is via tail recursion. hunk ./en/ch09-globs-and-regexps.xml 722 - It calls itself recursively, but this - isn't the last thing the function does. - The result of the recursive call is a parameter to the - (++) function. + Looking at the globToRegex' + function, we can see that it is not tail + recursive. To see why, examine its final clause again + (several of its other clauses are structured similarly). hunk ./en/ch09-globs-and-regexps.xml 727 - So what's going on here? Why are we not writing a tail - recursive definition for this function? The answer lies with - Haskell's non-strict evaluation strategy. Before we start - talking about that, let's quickly talk about why, in a - traditional language, we'd be trying to avoid this kind of - recursive definition. - - Returning to the clause above, - globToRegex' computes a little bit of its - result, then calls itself recursively, then returns the complete - result. In a traditional language, each recursive call is going - to require the allocation of temporary scratch memory until it - returns. - - Image: stack allocation for a recursive call to - globToRegex'. - - Compound these scratch allocations over a large number of - recursive calls, and the amount of space we use while processing - grows linearly with the size of the list we must process. - - Image: stack allocation for a pile of recursive - calls. + &GlobRegex.hs:last; hunk ./en/ch09-globs-and-regexps.xml 729 - For a problem like glob-to-regexp conversion, where we'll - always be dealing with very small amounts of data, this overhead - is insignificant. But if we had a hundred million elements to - process, and used plain recursion rather than tail recursion or - a loop to process them, we'd have a severe problem. + It applies itself recursively, and the result of + the recursive application is used as a parameter to the + (++) function. Since the recursive + application isn't the last thing the + function does, globToRegex' is not tail + recursive. hunk ./en/ch09-globs-and-regexps.xml 736 - Haskell neatly defangs this problem by deferring the - evaluation of an expression until its result is needed. To - understand how this deferred evaluation works, let's walk - through what happens when the final clause of the - globToRegex' is called. (As you read, bear - in mind that this is a simplified description.) + Why is our definition of this function not tail + recursive? The answer lies with Haskell's non-strict evaluation + strategy. Before we start talking about that, let's quickly + talk about why, in a traditional language, we'd try to + avoid this kind of recursive definition. Here is a simpler + definition, of the (++) operator. It is + recursivem, but not tail recursive. hunk ./en/ch09-globs-and-regexps.xml 744 - In order to determine that this clause must be called, the - pattern matcher must inspect a little bit of the list it was - passed. The pattern requires it to do nothing more than - determine whether it has been given an empty list (in which case - the clause will not be evaluated), or a non-empty list (the case - we're interested in). Once the pattern matcher has established - that the list isn't empty, it goes no deeper. It doesn't look - inside c, the head of the list. It doesn't - follow cs, the tail. + &append.hs:append; hunk ./en/ch09-globs-and-regexps.xml 746 - We have now decided to evaluate the right hand side of - the clause. The first expression we must evaluate is the call - to the list append function, (++). - (Because we don't need the results of escape c or - globToRegex' cs yet, we suspend the evaluation of - those functions, to evaluate when we'll need them.) So let's - remind ourselves what (++) looks - like: + In a strict language, if we evaluate "foo" ++ + "bar", the entire list is constructed, then returned. + Non-strict evaluation defers much of the work until it is + needed. hunk ./en/ch09-globs-and-regexps.xml 751 - &append.hs:append; + If we demand an element of the expression + "foo" ++ "bar", the first pattern of the function's + definition matches, and we return the expression x : (xs + ++ ys). Because the (:) constructor is + non-strict, the evaluation of xs ++ ys can be + deferred: we generate more elements of the result at whatever + rate they are demanded. When we generate more of the result, we + will no longer be using x, so the garbage + collector can reclaim it. Since we generate elements of the + result on demand, and do not hold onto parts that we are done + with, the compiler can evaluate our code in constant + space. + hunk ./en/ch09-globs-and-regexps.xml 765 - As the argument to (++) is not an - empty list, the second clause matches, so we must evaluate - x. This requires that we evaluate - escape c, which we had earlier deferred. Let's - look again at the definition of - escape. + + Making use of our pattern matcher hunk ./en/ch09-globs-and-regexps.xml 768 - &GlobRegex.hs:escape.noid; + It's all very well to have a function that can + match glob patterns, but we'd like to be able to put this to + practical use. On Unix-like systems, the + glob function returns the names of all + files and directories that match a given glob pattern. Let's + build a similar function in Haskell. Following the Haskell norm + of descriptive naming, we'll call our function + namesMatching. hunk ./en/ch09-globs-and-regexps.xml 777 - This will return a list whose first element is either the - \ character or the character it was passed. - Once we've constructed the first element of this list, we'll - defer constructing the rest of it until our caller needs it. We - suspend the evaluation of the rest of - escape so that we can pass this datum back - to our caller. Haskell implementations do this by saving away - enough information to resume the evaluation later; this saved - package of information is called a thunk. + &Glob.hs:module; hunk ./en/ch09-globs-and-regexps.xml 779 - Not only is the evaluation of escape - suspended; so too are the evaluation of its callers, - (++) and globToRegex', - and so on up the call chain until some expression needs to - inspect the value. + We specify that namesMatching is the + only name that users of our Glob module will be + able to see. hunk ./en/ch09-globs-and-regexps.xml 783 - What would happen if the caller of - globToRegex' was for some reason the - function null? null - only cares whether its argument is an empty list: once it finds - this out, it's never going to look at the remainder of the - result of globToRegex'. None of the work - we deferred in order to get a partial result back to - null will ever actually occur. + This function will obviously have to manipulate + filesystem paths a lot, splicing and joining them as it goes. + We'll need to use a few previously unfamiliar modules along the + way. hunk ./en/ch09-globs-and-regexps.xml 788 - Remarkably, given this kind of evaluation strategy, our - recursive definition of globToRegex' will - execute in constant space (more or less). To see why, look at - the definition of (++); although it's - simpler, it has a similar recursive structure. + The System.Directory module provides + standard functions for working with directories and their + contents. hunk ./en/ch09-globs-and-regexps.xml 793 - If we put on our implementer's hats and think about how we - might suspend the evaluation of (++) in a - thunk, all we need to store are the current heads of the left - and right lists; the result of the next evaluation step depends - on nothing more. On each resumption of the thunk, we could - simply peel off the head of the left list and update the thunk - with the new head, repeating until we reach its end. Therefore, - we can execute in constant space. - + &Glob.hs:import.directory; hunk ./en/ch09-globs-and-regexps.xml 795 - - Making use of our pattern matcher + The System.FilePath module abstracts the + details of an operating system's path name conventions. The + (</>) function joins two path + components. hunk ./en/ch09-globs-and-regexps.xml 801 - It's all very well to have a function that can match glob - patterns, but we'd like to be able to put this to practical use. - On Unix-like systems, the glob function - returns the names of all files and directories that match a - given glob pattern. Let's build a similar function in Haskell. - Following the Haskell norm of descriptive naming, we'll call our - function namesMatching. + &filepath.ghci:join; hunk ./en/ch09-globs-and-regexps.xml 803 - &Glob.hs:module; + The name of the + dropTrailingPathSeparator function is + perfectly descriptive. hunk ./en/ch09-globs-and-regexps.xml 807 - This function will obviously have to manipulate filesystem - paths a lot, splicing and joining them as it goes. We'll need - to use a few previously unfamiliar modules along the way. + &filepath.ghci:dropTrailingPathSeparator; hunk ./en/ch09-globs-and-regexps.xml 809 - The System.Directory - module provides standard functions for working with directories - and their contents. + The splitFileName function splits a + path at the last slash. hunk ./en/ch09-globs-and-regexps.xml 812 - &Glob.hs:import.directory; + &filepath.ghci:splitFileName; hunk ./en/ch09-globs-and-regexps.xml 814 - The System.FilePath - module abstracts the details of an operating system's path name - conventions. Using this together with the System.Directory module, we can - write a portable namesMatching function + Using System.FilePath together with the + System.Directory module, we + can write a portable namesMatching function hunk ./en/ch09-globs-and-regexps.xml 821 - Finally, we'll be emulating a for loop; - getting our first taste of exception handling in Haskell; and of - course using the matchesGlob function we - just wrote. + In this module, we'll be emulating a + for loop; getting our first taste of exception + handling in Haskell; and of course using the + matchesGlob function we just wrote. hunk ./en/ch09-globs-and-regexps.xml 828 - Since directories and files live in the real - world of activities that have effects, our globbing - function will have to have IO in its - result type. - - &Glob.hs:type; + Since directories and files live in the + real world of activities that have effects, our + globbing function will have to have IO in + its result type. hunk ./en/ch09-globs-and-regexps.xml 833 - If the string we're passed doesn't contain any pattern - characters, all we need to do is check that the given name - exists in the filesystem. (Notice that we use Haskell's - function guard syntax here to write a nice tidy definition. An - if would do, but isn't as aesthetically - pleasing.) + If the string we're passed contains no pattern + characters, we simply check that the given name exists in the + filesystem. (Notice that we use Haskell's function guard syntax + here to write a nice tidy definition. An if + would do, but isn't as aesthetically pleasing.) hunk ./en/ch09-globs-and-regexps.xml 839 - &Glob.hs:mundane; + &Glob.hs:namesMatching; hunk ./en/ch09-globs-and-regexps.xml 841 - (You might have noticed that the function - doesNameExist that we use here isn't in the - list of functions we imported from System.Directory. We'll define it - shortly.) + The name doesNameExist refers + to a function that we will define shortly. hunk ./en/ch09-globs-and-regexps.xml 845 - pattern? + pattern? Our function definition continues. hunk ./en/ch09-globs-and-regexps.xml 847 - &Glob.hs:otherwise; + &Glob.hs:namesMatching2; hunk ./en/ch09-globs-and-regexps.xml 849 - We use splitFileName to split the - string into a pair of everything but the final + We use splitFileName to split + the string into a pair of everything but the final hunk ./en/ch09-globs-and-regexps.xml 853 - directory. - - &Glob.hs:curdir; - - Otherwise, we must check the directory name and see if it - contains patterns. If it does not, we create a singleton list - of theb directory name. If it contains a pattern, we list all - of the matching directories. - - &Glob.hs:pats; - - Point out the use of return above. - This will be one of the first times the reader will have - encountered it. Even if we've already discussed this topic - earlier, we should repeat ourselves. + directory. Otherwise, we must check the directory name and see + if it contains patterns. If it does not, we create a singleton + list of the directory name. If it contains a pattern, we list + all of the matching directories. hunk ./en/ch09-globs-and-regexps.xml 859 - Using the The If we didn't remember (or know enough) to remove that - slash, we'd recurse endlessly in that call to - namesMatching above, because of the - following behaviour of - splitFileName. + If we didn't remember (or know enough) to remove + that slash, we'd recurse endlessly in + namesMatching, because of the following + behaviour of splitFileName. hunk ./en/ch09-globs-and-regexps.xml 874 - You can guess why this note got to be written! + You can guess what happened to us that led us to + add this note! hunk ./en/ch09-globs-and-regexps.xml 878 - Finally, we collect all matches in every directory, giving - us a list of lists, and concatenate them into a single list of - names. - - &Glob.hs:glue; + Finally, we collect all matches in every + directory, giving us a list of lists, and concatenate them into + a single list of names. hunk ./en/ch09-globs-and-regexps.xml 882 - The unfamiliar forM function above acts - a little like a for loop: it maps its second - argument (an action) over its first (a list), and returns the - resulting list. + The unfamiliar forM function + above acts a little like a for loop: it maps its + second argument (an action) over its first (a list), and returns + the list of results. hunk ./en/ch09-globs-and-regexps.xml 887 - We have a few loose ends to clean up. The first is the - definition of the doesNameExist function, - used above. The System.Directory doesn't let us - check to see if a name exists in the filesystem. It forces us - to think about whether we want to check for the existence of a - file or a directory, even if we don't care what kind of thing - the name is. This ungainly API is brought upon us by Windows. - We react by rolling the two checks into a single function, so - that we can ignore this portability quirk. In the name of - performance, we make the check for a file first, since files are - far more common than directories. + We have a few loose ends to clean up. The first + is the definition of the doesNameExist + function, used above. The System.Directory module doesn't let + us check to see if a name exists in the filesystem. It forces + us to decide whether we want to check for a file or a directory. + This API is ungainly, so we roll the two checks into a single + function. In the name of performance, we make the check for a + file first, since files are far more common than + directories. hunk ./en/ch09-globs-and-regexps.xml 900 - We have two other functions to define, each of which returns - a list of names in a directory. The + We have two other functions to define, each of + which returns a list of names in a directory. The hunk ./en/ch09-globs-and-regexps.xml 903 - files matching the given glob pattern in a directory, while the - listPlain function returns either an empty - or singleton list, depending on whether the single name it's - passed exists. + files matching the given glob pattern in a directory. hunk ./en/ch09-globs-and-regexps.xml 905 - &Glob.hs:listfuncs; + &Glob.hs:listMatches; + + The listPlain function returns either + an empty or singleton list, depending on whether the single name + it's passed exists. + + &Glob.hs:listPlain; hunk ./en/ch09-globs-and-regexps.xml 924 - This is telling us that handle takes - two arguments. The first is a function that is passed an - exception value, and can do I/O (we can see this because of the - IO type in its return value); this is the handler - to run if an exception is thrown. The second argument is the - code to run that might throw an exception. + This is telling us that + handle takes two arguments. The first is a + function that is passed an exception value, and can have side + effects (see the IO type in its return value); this + is the handler to run if an exception is thrown. The second + argument is the code that might throw an exception. hunk ./en/ch09-globs-and-regexps.xml 937 - The const function takes two arguments; - it always returns its first argument, no matter what its second - argument is. + The const function takes two + arguments; it always returns its first argument, no matter what + its second argument is. hunk ./en/ch09-globs-and-regexps.xml 943 - We won't have anything more to say about exception handling - here. There's plenty more to cover, though, so we'll be - returning to the subject of exceptions in chapter - FIXME. + We use const to write an exception + handler that ignores the exception it is passed. Instead, it + causes our code to return an empty list if we catch an + exception. + + We won't have anything more to say about exception + handling here. There's plenty more to cover, though, so we'll + be returning to the subject of exceptions in chapter . hunk ./en/ch09-globs-and-regexps.xml 958 - Although we've gone to some lengths to write a - portable namesMatching function, - the function uses our case sensitive + Although we've gone to some lengths to + write a portable namesMatching + function, the function uses our case sensitive hunk ./en/ch09-globs-and-regexps.xml 967 - Hint: consider reading the - documentation for Hint: consider + reading the documentation for If you're on a Unix-like system, look through the - documentation for the If you're on a Unix-like system, look + through the documentation for the The * wild card only matches names - within a single directory. Many shells have an extended - wild card syntax, **, that matches - names recursively in all directories. For example, - **.c would mean match a name - ending in .c in this directory or - any subdirectory at any depth. Implement - matching on ** wildcards. + The * wild card only + matches names within a single directory. Many shells + have an extended wild card syntax, + **, that matches names recursively in + all directories. For example, **.c + would mean match a name ending in + .c in this directory or any + subdirectory at any depth. Implement matching + on ** wildcards. hunk ./en/ch09-globs-and-regexps.xml 1062 - the possibility of a bad pattern, and make it call your + the possibility of a bad pattern, and make it use your hunk ./en/ch09-globs-and-regexps.xml 1065 + + + You may find the amount of work involved to be + surprisingly large. Don't worry; we will introduce + more concise and sophisticated ways of dealing with + errors in later chapters. + hunk ./en/ch09-globs-and-regexps.xml 1093 - Once again, we work around the ungainly file/directory split - that portability has forced upon Once again, we work around the ungainly + file/directory split in + function. hunk ./en/ch09-globs-and-regexps.xml 1104 - namesMatchingfunctions, so that we can + namesMatching functions, so that we can hunk ./en/ch09-globs-and-regexps.xml 1111 - The cc2cpp function uses a few - functions we'll be seeing over and over. The - mapM function maps a function that can do - I/O over a list. The flip function takes - another function as argument, and swaps the order of its - arguments (inspect the type of - replaceExtension in &ghci; to see why). + The cc2cpp function uses a + few functions we'll be seeing over and over. The + flip function takes another function as + argument, and swaps the order of its arguments (inspect the type + of replaceExtension in &ghci; to see why). hunk ./en/ch09-globs-and-regexps.xml 1117 - its right hand side as an argument to its left. + the action on its right side to the action on its left. hunk ./en/ch09-globs-and-regexps.xml 1132 - - - Wrapping it up - - Write some kind of intelligible conclusion. - hunk ./examples/ch09/Glob.hs 22 -{-- snippet mundane --} +{-- snippet namesMatching --} hunk ./examples/ch09/Glob.hs 30 -{-- /snippet mundane --} -{-- snippet otherwise --} +{-- /snippet namesMatching --} + +{-- snippet namesMatching2 --} hunk ./examples/ch09/Glob.hs 34 -{-- /snippet otherwise --} -{-- snippet curdir --} hunk ./examples/ch09/Glob.hs 38 -{-- /snippet curdir --} -{-- snippet pats --} hunk ./examples/ch09/Glob.hs 42 -{-- /snippet pats --} -{-- snippet glue --} hunk ./examples/ch09/Glob.hs 49 -{-- /snippet glue --} +{-- /snippet namesMatching2 --} hunk ./examples/ch09/Glob.hs 51 -{-- snippet listfuncs --} +{-- snippet listMatches --} hunk ./examples/ch09/Glob.hs 64 +isHidden ('.':_) = True +isHidden _ = False +{-- /snippet listMatches --} + +{-- snippet listPlain --} hunk ./examples/ch09/Glob.hs 75 -{-- /snippet listfuncs --} - -{-- snippet isHidden --} -isHidden :: String -> Bool -isHidden ('.':_) = True -isHidden _ = False -{-- /snippet isHidden --} +{-- /snippet listPlain --} hunk ./examples/ch09/GlobRegex.hs 1 +{-- snippet type --} +-- file: GlobRegex.hs + hunk ./examples/ch09/GlobRegex.hs 12 -{-- snippet type --} hunk ./examples/ch09/GlobRegex.hs 16 -globToRegex cs = '^' : globToRegex' cs +globToRegex cs = '^' : globToRegex' cs ++ "$" hunk ./examples/ch09/GlobRegex.hs 21 -globToRegex' "" = "$" +globToRegex' "" = "" + hunk ./examples/ch09/GlobRegex.hs 24 -{-- /snippet asterisk --} hunk ./examples/ch09/GlobRegex.hs 25 -{-- snippet question --} hunk ./examples/ch09/GlobRegex.hs 26 -{-- /snippet question --} hunk ./examples/ch09/GlobRegex.hs 27 -{-- snippet class --} hunk ./examples/ch09/GlobRegex.hs 28 -globToRegex' ('[':c:cs) = '[' : c : charClass cs -globToRegex' ('[':_) = error "unterminated character class" -{-- /snippet class --} +globToRegex' ('[':c:cs) = '[' : c : charClass cs +globToRegex' ('[':_) = error "unterminated character class" + +globToRegex' (c:cs) = escape c ++ globToRegex' cs +{-- /snippet asterisk --} + +{- hunk ./examples/ch09/GlobRegex.hs 38 +-} hunk ./examples/ch09/GlobRegex.hs 42 -escape c - | c `elem` regexChars = '\\' : [c] - | otherwise = [c] +escape c | c `elem` regexChars = '\\' : [c] + | otherwise = [c] hunk ./examples/ch09/GlobRegex.hs 50 -charClass (c:cs) = c: charClass cs -charClass [] = error "unterminated character class" +charClass (c:cs) = c : charClass cs +charClass [] = error "unterminated character class" hunk ./examples/ch09/HighestClose.hs 1 +{-- snippet closing --} +-- file: HighestClose.hs + +import qualified Data.ByteString.Lazy.Char8 as L + +closing = readPrice . (!!4) . L.split ',' +{-- /snippet closing --} + +{-- snippet readPrice --} +readPrice :: L.ByteString -> Maybe Int +readPrice str = + case L.readInt str of + Nothing -> Nothing + Just (dollars,rest) -> + case L.readInt (L.tail rest) of + Nothing -> Nothing + Just (cents,more) -> + Just (dollars * 100 + cents) +{-- /snippet readPrice --} + +{-- snippet highestClose --} +highestClose = maximum . (Nothing:) . map closing . L.lines + +highestCloseFrom path = do + contents <- L.readFile path + print (highestClose contents) +{-- /snippet highestClose --} hunk ./examples/ch09/SumFile.hs 1 +{-- snippet main --} +main = do + contents <- getContents + print (sumFile contents) + where sumFile = sum . map read . words +{-- /snippet main --} hunk ./examples/ch09/Useful.hs 26 -cc2cpp name = - mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc" +cc2cpp = + mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc" hunk ./examples/ch09/append.hs 4 -(++) :: [a] -> [a] -> [a] -[] ++ ys = ys +(++) :: [a] -> [a] -> [a] + hunk ./examples/ch09/append.hs 7 +[] ++ ys = ys hunk ./examples/ch09/filepath.ghci 1 +--# join +:m +System.FilePath +"foo" "bar" + +--# dropTrailingPathSeparator +dropTrailingPathSeparator "foo/" + +--# splitFileName +splitFileName "foo/bar/Quux.hs" +splitFileName "zippity" hunk ./examples/ch09/glob-regexp.ghci 26 -let fnmatch pat name = name =~ globToRegex pat :: Bool +let fnmatch pat name = name =~ globToRegex pat :: Bool hunk ./examples/ch09/glob-regexp.ghci 29 -fnmatch "myname" "d*" +fnmatch "d*" "myname" hunk ./examples/ch09/prices.csv 1 +Date,Open,High,Low,Close,Volume,Adj Close +2008-08-01,20.09,20.12,19.53,19.80,19777000,19.80 +2008-06-30,21.12,21.20,20.60,20.66,17173500,20.66 +2008-05-30,27.07,27.10,26.63,26.76,17754100,26.76 +2008-04-30,27.17,27.78,26.76,27.41,30597400,27.41 }