[More progress on glob to regexp conversion. Bryan O'Sullivan **20070611063432] { move ./examples/ch07/glob-to-regexp/GlobWithRegex.hs ./examples/ch07/glob-to-regexp/GlobRegex.hs hunk ./en/00book.xml 22 - - + + + + + + + + + hunk ./en/00book.xml 35 + hunk ./en/Makefile 10 - ch07/glob-to-regexp/GlobWithRegex.hs \ + ch07/glob-to-regexp/GlobRegex.hs \ + ch07/glob-regexp.ghci \ hunk ./en/ch07-globs-and-regexps.xml 293 - Other regular expression packages + Other things you should know hunk ./en/ch07-globs-and-regexps.xml 320 - match of foo, while a POSIX engine will - match the entire string (the second branch of the - alternative in the regexp). + match of foo (the leftmost match), while + a POSIX engine will match the entire string (the greediest + match). hunk ./en/ch07-globs-and-regexps.xml 341 + + Translating a glob pattern into a regular expression + + Now that we've seen the myriad of ways to match text against + regular expressions, let's turn our attention back to glob + patterns. We want to write a function that will take a glob + pattern and return its representation as a regular expression. + Both glob patterns and regexps are text strings, so the type + that our function ought to have seems clear. + + &GlobRegex.hs:type; + + We start our definition of the + globToRegex function by recalling that a + text string must match a glob pattern must match. Thus before + we attempt to convert any part of the glob pattern, we need to + have a rooted regular expression. + + &GlobRegex.hs:rooted; + + Recall that the String is just a synonym for + [Char], a list of Chars. The + : operator puts a value (the + ^ character in this case) onto the front of a + list, where the list is the value returned by the yet-to-be-seen + globToRegex' function. + + + It's perfectly normal for a Haskell function to refer to + functions or values that follow, rather than precede, it in a + source file. The Haskell compiler doesn't care about ordering + at this level. This grants you the flexibility to structure + your code in the manner that makes most logical sense to you, + as opposed to a way that makes the compiler writer's life + easiest. + + + With the regular expression rooted, the + globToRegex' function will do the bulk of + the translation work. We'll use the convenience of Haskell's + pattern matching to enumerate each of the cases we'll need to + cover. + + &GlobRegex.hs:asterisk; + + We now have a very minimal glob translator. Our first + clause requires that if we hit the end of our glob pattern, we + return $, the regular expression symbol for + match end-of-line. The second gets us to + substitute the string .* every time we see a + * in our input list. And the third passes + every other character through, unaltered. + + We can immediately save our new source file (let's call it + GlobRegexTiny.hs and start playing with it + in ghci. This is a great way to do + exploratory programming: write a little code, load it into the + + interpreter, and see what happens. + + &glob-regexp.ghci:tiny; + + These results might seem trivial, but we have received + immediate feedback on two fronts: our code passes the scrutiny + of the typechecker, and it produces sensible-looking results. + Monkeying around with code in the interpreter early and often is + consistently worthwhile. + + The remaining cases for a complete definition of + globToRegex' are easy to enumerate. First, + we get ? out of the way. + + &GlobRegex.hs:question; + + More interesting is how we handle character classes. + + &GlobRegex.hs:class; + + We take advantage of two behaviours of Haskell's pattern + matcher here. The first is that it will match patterns in the + order in which we declare them, so we place the most specific + patterns first, and the least specific last. Secondly, there's + no limit to the depth of a pattern: we can + peek forwards into a list as far as we need + to. + + Here, the first pattern matches on three consecutive items + of its input to ensure that a negative character + class cannot be empty. The second pattern matches on only two + items, so ensure that a positive character class + cannot be empty. The final clause can only match if it's given + a syntactically invalid character class. + + Finally, all other characters can be passed through, but + they may need to be escaped, so that the regexp engine won't + treat them specially. + + &GlobRegex.hs:rest; + + The charClass helper function does + nothing more than ensure that a character class is correctly + terminated. It passes its input through unmolested until it + hits a ], when it hands control back to + globToRegex'. + + &GlobRegex.hs:charClass; + + addfile ./examples/ch07/glob-regexp.ghci hunk ./examples/ch07/glob-regexp.ghci 1 +--# tiny + +:load glob-to-regexp/GlobRegexTiny.hs + +globToRegex "*" + +globToRegex "" + +globToRegex "**" hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 2 -module GlobWithRegex +module GlobRegex hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 14 +{-- snippet type --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 16 +{-- /snippet type --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 18 +{-- snippet rooted --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 20 +{-- /snippet rooted --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 22 +{-- snippet asterisk --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 24 - -globToRegex' ('*':cs) = ".*" ++ globToRegex' cs -globToRegex' ('?':cs) = '.' : globToRegex' cs -globToRegex' ('[':'!':cs) = charClass "^[" cs -globToRegex' ('[':cs) = charClass "[" cs -globToRegex' (c:cs) - | c == escapeChar = let (c', cs') = splitAt 1 cs - in '\\' : c' ++ globToRegex' cs' - | c `elem` "+()^$.{}]|" = '\\' : c : globToRegex' cs - | otherwise = c : globToRegex' cs hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 25 +globToRegex' ('*':cs) = ".*" ++ globToRegex' cs +{-- /snippet asterisk --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 28 -escapeChar :: Char +{-- snippet question --} +globToRegex' ('?':cs) = '.' : globToRegex' cs +{-- /snippet question --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 32 -escapeChar | pathSeparator == '/' = '\\' - | pathSeparator == '\\' = '`' - | otherwise = error "unknown platform in use" +{-- snippet class --} +globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs +globToRegex' ('[':c:cs) = '[' : c : charClass cs +globToRegex' ('[':_) = error "unterminated character class" +{-- /snippet class --} +{-- snippet rest --} +globToRegex' (c:cs) = escape c ++ globToRegex' cs hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 40 -charClass :: String -> String -> String +escape :: Char -> String +escape c + | c `elem` regexChars = '\\' : [c] + | otherwise = [c] + where regexChars = "\\+()^$.{}]|" +{-- /snippet rest --} hunk ./examples/ch07/glob-to-regexp/GlobRegex.hs 47 -charClass acc (']':cs) = reverse (']':acc) ++ globToRegex' cs -charClass acc (c:cs) = charClass (c:acc) cs -charClass acc [] = error "unterminated character class" +{-- snippet charClass --} +charClass :: String -> String +charClass (']':cs) = ']' : globToRegex' cs +charClass (c:cs) = c: charClass cs +charClass [] = error "unterminated character class" +{-- /snippet charClass --} addfile ./examples/ch07/glob-to-regexp/GlobRegexTiny.hs hunk ./examples/ch07/glob-to-regexp/GlobRegexTiny.hs 1 +globToRegex :: String -> String + +globToRegex cs = '^' : globToRegex' cs + +globToRegex' :: String -> String + +globToRegex' "" = "$" +globToRegex' ('*':cs) = ".*" ++ globToRegex' cs }