[Write up a big honkin' hunk of regexp material. Bryan O'Sullivan **20070611003617] { hunk ./en/00book.xml 17 - presented in the diff output. --> + presented in the diff output. hunk ./en/00book.xml 19 - - + Please keep this section sorted lexicographically, otherwise it + gets unwieldy fast. --> + + + merger 0.0 ( hunk ./en/00book.xml 27 + hunk ./en/00book.xml 27 - - + + + + + + + + + + + + + + + + + + ) hunk ./en/ch07-globs-and-regexps.xml 110 - languages. + languages. Haskell's regular expression matching libraries are + a lot more expressive than those of other languages, so there's + plenty to talk about. + + To begin our use of the regexp libraries, the only module + we'll need to work with is Text.Regex.Posix. As usual, the + most convenient way to explore this module is by interacting + with it via ghci. + + ®exp.ghci:load; + + The only function that + we're likely to need for normal use is the regexp matching + function, an infix operator named =~ + (borrowed from Perl. The first hurdle to overcome is that + Haskell's regexp libraries make heavy use of polymorphism. As a + result, the type signature of the =~ + operator is a bit difficult to understand. Let's drop into + ghci and see what it looks like. + + ®exp.ghci:typetwiddle; + + Ouch! Rather than pick apart what this convoluted + definition means, let's take a shortcut. The + =~ operator uses typeclasses for both of + its arguments, and also for its return type. The first argument + (on the left of the =~) is the text to + match; the second (on the right) is the regular expression to + match against. We can pass either a String or a + Data.ByteString as either argument. + + + The many types of result + + The =~ operator is polymorphic in its + return type, so the Haskell compiler needs some way to know + what type of result we would like. In real code, it may be + able to infer the right type, due to the way we subsequently + use the result. But such cues are often lacking when we're + exploring with ghci. If we omit a specific + type for the result, we'll get an intimidating-looking error + message. Here's what ghci tells us if we + try to do a regexp match but omit the return type. + + ®exp.ghci:noreturn; + + That's quite an impenetrable error message, but what it + means is that ghci can't infer what type of + result to give back to us. (This is the same kind of error + that a compiler would print if it couldn't infer the correct + type in a real program.) Here's the key to dealing with this + error. When ghci can't infer the target type, we tell it what we'd + like the type to be. If we want an expression of type + Bool, we'll get a pass/fail result. + + ®exp.ghci:bool.passfail; + + In the bowels of the base regexp library, there's a + typeclass named RegexContext that + describes how a target type should behave; the + base library defines many instances of this typeclass for us. + The Bool type is an instance of this typeclass, + so we get back a usable result. Another such instance is + Int, which gives us a count of the number of + times the regexp matches. + + ®exp.ghci:int.count; + + If we ask for a String result, we'll get the + first substring that matches, or an empty string if nothing + matches. + + ®exp.ghci:string; + + + Getting back an empty string for no match + poses an obvious difficulty if the empty string could be a + valid match for the regexp. + + FIXME: Find some place to mention the monadic + =~~ operator. + + + Another valid type of result is [String], + which returns a list of all matching + strings. + + ®exp.ghci:list.string; + + That's about it for simple result types, + but we're not by any means finished. Before we continue, + let's use a single pattern for our remaining examples. + + ®exp.ghci:let; + + We can obtain quite a lot of information about the + context in which a match occurs. If we ask for a + three-element tuple, we'll get back the text + before the first match, the text + of that match, and the text that + follows it. + + ®exp.ghci:3tuple.match; + + If the match fails, the entire text is returned as the + before element of the tuple, with the other two + elements left empty. + + ®exp.ghci:3tuple.nomatch; + + Asking for a four-element tuple gives us a fourth element + that's a list of all groups in the pattern that + matched. + + ®exp.ghci:4tuple; + + We can get numeric information about matches, + too. A pair of Ints gives us the starting offset + of the first match, and its length. If we ask for a list of + these pairs, we'll get this information for all + matches. + + ®exp.ghci:2tuple.match; + + A failed match is represented by the value + -1 as the first element of the tuple (the + match offset), or an empty list of tuples. + + ®exp.ghci:2tuple.nomatch; + + Believe it or not, this is not a + comprehensive list of built-in instances of the + RegexContext typeclass. For a complete + list, see the documentation for the Text.Regex.Base.Context + module. + + + + + Mixing and matching string types + + As we noted earlier, the =~ operator + uses typeclasses for its argument types and its return type. + Any combination of String and + ByteString will work for the types of both the + regular expression and the text to match against. + + ®exp.ghci:mix.simple; + + However, be aware that if you want a string + value in the result of a match, the text you're matching + against must be the same type of string. Let's see what this + means in practice. + + ®exp.ghci:mix.match; + + In the above example, we've used the + pack to turn a String into a + ByteString. The typechecker accepts this because + ByteString appears in the result type. But if we + try getting a String out, that + won't work. + + ®exp.ghci:mix.nomatch; + + And ouch! That's quite an error message. Don't let its + verbosity confound you; we can easily fix this problem by + making the string types of the left hand side and the result + match once again. + + ®exp.ghci:mix.rematch; + + This restriction does not apply to + the type of the regexp to match against. It can be either a + String or ByteString, unconstrained + by the other types in use. + + + + Other regular expression packages + + When you look through Haskell library documentation, + you'll see several regexp-related modules. The modules under + Text.Regex.Base define + the common API adhered to by all of the other regexp modules. + It's possible to have multiple implementations of the regexp + API installed at one time. At the time of writing, + GHC is bundled with one + implementation, Text.Regex.Posix. As its name + suggests, this package provides POSIX regexp semantics. + + + If you're coming to Haskell from a language like Perl, + Python, or Java, and you've used regular expressions in one + of those languages, you should be aware that the POSIX + regexps handled by the Text.Regex.Posix module are + different in some significant ways from Perl-style + regexps. Here are a few of the more notable differences. + + Perl regexp engines do left-biased matching when + matching alternatives, whereas POSIX engines choose the + greediest match. What this means is that given a regexp of + (foo|fo*) and a text string of + foooooo, a Perl-style engine will give a + match of foo, while a POSIX engine will + match the entire string (the second branch of the + alternative in the regexp). + + POSIX regexps have less uniform syntax + than Perl-style regexps. They also lack a number of + capabilities provided by Perl-style regexps, such as + zero-width assertions and control over greedy + matching. + + + Other Haskell regexp packages are available for download + on the Internet. Some provide faster engines than the current + POSIX engine; others provide Perl-style matching capabilities. + All follow the standard API that we have covered in this + section. + + hunk ./en/ch07-globs-and-regexps.xml 339 - The first hurdle to overcome is that Haskell's regexp - library makes heavy use of polymorphism; as a result, many of - its type signatures are difficult to understand. -®exp.ghci:typetwiddle; hunk ./en/ch07-globs-and-regexps.xml 340 + hunk ./examples/ch07/regexp.ghci 1 ---# typetwiddle +--# load + hunk ./examples/ch07/regexp.ghci 4 + +--# typetwiddle + hunk ./examples/ch07/regexp.ghci 9 +--# noreturn + +"my left foot" =~ "foo" + +--# bool.passfail + +"my left foot" =~ "foo" :: Bool + +"your right hand" =~ "bar" :: Bool + +"your right hand" =~ "(hand|foot)" :: Bool + +--# int.count + +"a star called henry" =~ "planet" :: Int + +"honorificabilitudinitatibus" =~ "[aeiou]" :: Int + +--# string + +"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String + +"hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: String + +--# list.string + +"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String] + +"hi ludi, F. Baconis nati, tuiti orbi" =~ "Shakespeare" :: [String] + +--# let + +let pat = "(foo[a-z]*bar|quux)" + +--# 3tuple.match + +"before foodiebar after" =~ pat :: (String,String,String) + +--# 3tuple.nomatch + +"no match here" =~ pat :: (String,String,String) + +--# 4tuple + +"before foodiebar after" =~ pat :: (String,String,String,[String]) + +--# 2tuple.match + +"before foodiebar after" =~ pat :: (Int,Int) +"before foodiebar after" =~ pat :: [(Int,Int)] + +--# 2tuple.nomatch + +"eleemosynary" =~ pat :: (Int,Int) +"mondegreen" =~ pat :: [(Int,Int)] + +--# mix.simple + +:module +Data.ByteString.Char8 + +:type pack "foo" + +pack "foo" =~ "bar" :: Bool +"foo" =~ pack "bar" :: Int +pack "foo" =~ pack "o" :: [(Int, Int)] + +--# mix.match + +pack "good food" =~ ".ood" :: [ByteString] + +--# mix.nomatch + +"good food" =~ ".ood" :: [ByteString] + +--# mix.rematch + +"good food" =~ ".ood" :: [String] + }