[Progress on chapter 10. Bryan O'Sullivan **20071104214520] { move ./en/ch10-barcode.xml ./en/ch10-binary.xml adddir ./examples/ch10 hunk ./en/00book.xml 20 - + hunk ./en/book-shortcuts.xml 99 +newtype"> hunk ./en/ch10-binary.xml 3 - - FIXME - FIXME. + + Code case study: parsing a binary file format + + In this chapter, we'll discuss a common task: parsing a binary + file. We're going to use this for two purposes. Our first is to + talk a little about parsing, but our main goal is to talk about + program organisation and boilerplate + removal. + + As our task, we'll choose parsing a few different netpbm file + types. These file formats have the dual advantages of wide use + and being fairly easy, but not completely trivial, to parse. Most + importantly for our convenienve, netpbm files are not + compressed. + + + Greyscale files + + The name of netpbm's greyscale file format is PGM + (portable grey map). It is actually not one + format, but two; the plain (or P2) + format is encoded as ASCII, while the more common + raw (P5) format is mostly + binary. + + A file of either format starts with a header, which in turn + begins with a magic string describing the format. + For a plain file, the string is P5, and for + raw, it's P2. The magic string is followed by + white space, then by three numbers: the width, height, and + maximum grey value of the image. These numbers are represented + as ASCII decimal numbers, separated by white space. + + After the maximum grey value comes the image data. In a raw + file, this is a string of binary values. In a plain file, the + values are represented as ASCII decimal numbers separated by + white space. + + A raw file can contain a sequence of images, one after the + other, each with its own header. A plain file contains only one + image. + + + + + Parsing a raw PGM file + + For our first try at a parsing function, we'll only worry + about raw PGM files. We'll write our PGM parser as a + pure function. It's not responsible for + obtaining the data to parse, just for the actual parsing. This + is a common approach in Haskell programs. By separating the + reading of the data from what we subsequently do with it, we + gain flexibility in where we take the data from. + + We'll use the ByteString type to store our + greymap data, because it's compact. + + &PNM.hs:imports; + + For our purposes, it doesn't matter whether we use a lazy or + strict ByteString, so we've somewhat arbitrarily + chosen the lazy kind. + + The ByteString module contains many definitions + that have the same names as existing Prelude definitions that + are automatically imported for us. Because of this, if we try + to use a name that is present in both the + ByteString module and the Prelude, the compiler + will complain about ambiguity. We avoid this problem by + importing the module under an alias, L: every time + you see a name prefixed with L., we're using the + name from ByteString. + + We'll use a straightforward data type to represent PGM + files. + + &PNM.hs:Greymap; + + Normally, a Haskell Show instance should + produce a string representation that we can read back by calling + read. However, for a bitmap graphics file, + this would potentially produce huge text strings, for example if + we were to show a photo. For this reason, + we're not going to let the compiler automatically derive a + Show instance for us: we'll write our own, + intentionally less capable, implementation. + + &PNM.hs:Show; + + Because our Show instance intentionally avoids + printing the bitmap data, there's no point in writing a + Read instance, as we can't reconstruct a valid + Greymap from the result of + show. + + Here's an obvious type for our parsing function. + + &PNM.hs:parseP5.type; + + This will take a ByteString, and if the parse + succeeds, it will return the parsed Greymap, along + with the string that remains after parsing. + + Our parsing function has to consume a little bit of its + input at a time. First, we need to assure ourselves that we're + really looking at a raw PGM file; then we need to parse the + numbers from the remainder of the header; then we consume the + bitmap data. Here's an obvious way to express this. + + &PNM.hs:parseP5; + + This is a very direct piece of code. Each function that it + calls returns the residual ByteString left over + after it has finished. It deconstructs each result in turn, + either failing if the function failed, or building up a piece of + the result as it continues. The bodies of the functions that it + calls aren't especially interesting. + + &PNM.hs:parseP5.functions; + + + + + Getting rid of boilerplate code + + Our parseP5 function is somehow + unsatisfying. It marches steadily to the right of the screen, + so it's clear that a slightly more complicated function would + run out of space. And it repeats a pattern of + constructing and then deconstructing Maybe values, + only continuing if a particular value matches + Just. All of the similar &case; + expressions act as boilerplate code, lots of + busywork that obscures what we're really trying to do. In + short, this function is begging for a little abstraction. + + If we step back a little, we can see two patterns. First is + that the functions that we're calling have similar types. Each + takes a ByteString as its last argument, and + returns Maybe something else. Secondly, every step + in the ladder of our parseP5 + function deconstructs a Maybe value, and either + fails or passes the unwrapped result to a function. + + We can quite easily write a function that captures this + second pattern. + + &PNM.hs:then; + + The (>>>) function acts very + simply: it takes a value as its left argument, and a function as + its right. If the value is not Nothing, it + calls the function. We have written this function as an + operator so that we can use it to chain functions + together. + + With this chaining function in hand, we can take a second + try at our parsing function. + + &PNM.hs:parseP5_take2; + + The key to understanding this function is to think about the + chaining. On the left hand side of each + (>>>) is a Maybe + value; on the right is a function that returns a + Maybe value. Each expression is thus of type + Maybe, suitable for passing to the following + (>>>) expression. + + The other change that we've made to improve readability is + add a skipSpace function. With these + changes, we've exactly halved the number of lines of code + compared to our original parsing function. By removing the + boilerplate &case; expressions, we've made the code easier to + follow. + + However, we're not yet out of the woods. This code + explicitly passes two-tuples around, using one element for an + intermediate part of the parsed result and the other for the + current residual ByteString. If we want to extend + the code, for example to track the number of bytes we've + consumed so that we can report the location of a parse failure, + we need to modify eight different locations so that we can pass + a three-tuple around. + + + + Implicit state + + While we've gotten rid of some boilerplate code, the + two-tuple that we use to pass around our partial result and + residual string is a serious problem: it makes our code + difficult to change. + + We can do something about this, though. First, let's + augment the state that our parser uses. + + &Parse.hs:ParseState; + + We can now track both the current residual string and the + offset into the original string since we started parsing. This + lets us think of parsing as a function from one + ParseState to another, also returning the result of + the parse. + + &Parse.hs:Parse; + + The &newtype; declaration for the Parse type + just acts as a safety wrapper around this function type. It + allows us to ensure that we can't accidentally run a + parser. + + Let's try to define a minimal parser. + + &Parse.hs:identity; + + All this function has to do is take a parse state, leave it + untouched, and use the function's argument as the result of the + parse. We wrap this function in our Parse type, + but how can we actually use this wrapped function to parse + something? + + The first thing we must do is peel off the + Parse wrapper so that we can get at the + function inside. We do this by calling + runParse. We also need to construct a + ParseState, then run our parsing function on that + parse state. Finally, we'd like to extract the result of the + parse from the final ParseState. + + &Parse.hs:parse; + + Because neither the identity parser nor + the parse function examine the parse state + at all, we don't even need to bother creating an input string in + order to try this out. + + &parse.ghci:parse; + + A parser that doesn't even inspect its input isn't very + interesting, but at least we have confidence that our types are + correct. Let's focus now on writing a parser that does + something meaningful. We're not going to get very ambitious + yet, though: all we want to do is parse a single byte. + + &Parse.hs:parseByte; + + There's some unfamiliar code in use here, so let's take a + deeper look. The (==>) function serves + a similar purpose as our earlier + (>>>) function, acting as + glue to let us chain functions together. + + &Parse.hs:then; + + Indeed, the types of the two functions are very similar. The + body of (==>) is interesting, and ever + so slightly tricky. Remember that Parse is really + a function type with a wrapper. Therefore, + (==>) must return a function, in a + wrapper. It doesn't really do much: it just + creates a closure to remember the values of + x and f. This closure + won't be unwrapped and called until we call + parse. At that point, it will be called + with a ParseState. It will call the + Parse that is its left argument and inspect its + result. If that parse failed, the closure fails too. Otherwise, + it passes the result of the parse and the new + ParseState to f. + + This is really quite fancy and subtle stuff: we're + effectively passing the ParseState down the chain + of Parse values in a hidden argument. Our + parseByte function doesn't take the parse + state as an argument. Instead, it has to call + getState to get a copy of the state, and + putState to replace the current state with + a new one. + + &Parse.hs:getPut; + + When reading these functions, recall that the left element + of the tuple is the result of a Parse, while the + right is the current ParseState. This makes it + easier to follow what these functions are doing. + + + + + Introducing functors + + We're by now thoroughly familiar with the + map function, which applies a function to + every element of a list, returning a list of possibly a + different type. + + &functor.ghci:map; + + This map-like activity can be useful in + other instances. For example, consider a binary tree. + + &TreeMap.hs:Tree; + + If we want to take a tree of strings and turn it into a tree + containing the lengths of those strings, we could write a + function to do this. + + &TreeMap.hs:treeLengths; + + Now that our eyes are attuned to looking for patterns that + we can turn into generally useful functions, we can see a + possible case of this here. + + &TreeMap.hs:treeMap; + + As we might hope, treeLengths and + treeMap length give the same + results. + + &functor.ghci:treeLengths; + + Haskell provides a well-known typeclass to further + generalise treeMap. This typeclass is + named Functor, and it defines one function, + fmap. + + &TreeMap.hs:Functor; + + We can think of fmap as a kind of + lifting function, as we introduced in . It takes a function over + ordinary values a -> b and turns it into a + function over containers f a -> f b, where + f is the container type. + + If we substitute Tree for the type variable + f, for example, the type of + fmap is identical to the type of + treeMap, and in fact we can use + treeMap as the implementation of + fmap over Trees. + + &TreeMap.hs:Functor.Tree; + + We can also use map as the + implementation of fmap for lists. + + &TreeMap.hs:Functor.List; + + We can now use fmap over different + container types. + + &functor.ghci:fmap; + + The Prelude defines instances of Functor for + several common types. The instance for lists is provided in the + Prelude, as is the instance for Maybe. + + &TreeMap.hs:Functor.Maybe; + + The instance for Maybe makes it particularly + clear what an fmap implementation needs to + do. The implementation must have a sensible behaviour for each + of a type's constructors. If a value is wrapped in + Just, for example, the fmap + implementation calls the function on the unwrapped value, then + rewraps it in Just. + + The definition of Functor imposes a few obvious + restrictions on what we can do with fmap. + For example, we can only make instances of + Functor from types that have exactly one + free type variable. + + + What's a free type variable? + + A free type variable is a lower-case type variable, such + as a, that hasn't been bound to a + particular type. For example, the type Maybe a + has one free type variable, but Maybe Int has + none. We say that the type variable a is + bound to the type + Int. + + + We can't write an fmap implementation + for Either a b or (a, b), for example, + because these have two free type variables. We also can't write + work with Bool or Int, as they have no + free type variables. + + + Flexible instances + + You might hope that we could write a Functor + instance for the type Either Int b, which has one + free type variable. + + &EitherInt.hs:Functor; + + However, the type system of Haskell 98 cannot guarantee + that checking the constraints on such an instance will + terminate. A non-terminating constraint check will send a + compiler into an infinite loop, so instances of this form are + forbidden. + + &functor.ghci:EitherInt; + + &GHC; has a more powerful type system than the base + Haskell 98 standard. It operates in standard compatibility + mode by default, for maximal portability. We can instruct it + to allow more flexible instances using a special compiler + directive. + + &EitherIntFlexible.hs:Functor; + + The directive is embedded in the specially formatted + LANGUAGE comment. These directives are usually + referred to as pragmas. Pragmas are always + enclosed in the special comment sequences {-#, to + begin, and #-}, to end. + + &GHC; supports many kinds of pragma. Most pragmas only + have meaning at specific locations in a source file. Language + pragmas, for example, are only obeyed if they are present at + the beginning of a source file. + + With our Functor instance in hand, let's try + out fmap on Either + Int. + + &functor.ghci:EitherIntFlexible; + + + + + Thinking more about functors + + We've made a few implicit assumptions about how functors + ought to work. It's helpful to make these explicit and to + think of them as rules to follow, because this lets us treat + functors as uniform, well-behaved objects. + + Our first rule is that a functor must preserve + identity. That is, applying fmap + id to a value should give us back an identical + value. + + &functor.ghci:id; + + Our second rule is that a functor must preserve + shape. The structure of a collection + should not be affected by a functor; only the values that it + contains should change. + + &functor.ghci:shape; + + Finally, functors must be composable. + That is, composing two uses of fmap + should give the same result as one fmap + with the same functions composed. + + &functor.ghci:composition; + + If you're writing a Functor instance, it's + useful to keep these rules in mind, and indeed to test them. + Otherwise, the behaviour they specify is + natural enough that there's no need to memorise + them; they just formalise the notions of do what I + mean. + + hunk ./en/ch12-binary.xml 3 - + addfile ./examples/ch10/EitherInt.hs hunk ./examples/ch10/EitherInt.hs 1 +{-- snippet Functor --} +instance Functor (Either Int) where + fmap _ (Left n) = Left n + fmap f (Right r) = Right (f r) +{-- /snippet Functor --} addfile ./examples/ch10/EitherIntFlexible.hs hunk ./examples/ch10/EitherIntFlexible.hs 1 +{-- snippet Functor --} +{-# LANGUAGE FlexibleInstances #-} + +instance Functor (Either Int) where + fmap _ (Left n) = Left n + fmap f (Right r) = Right (f r) +{-- /snippet Functor --} addfile ./examples/ch10/PNM.hs hunk ./examples/ch10/PNM.hs 1 +module PNM + ( + Anymap(..) + , Greymap(..) + ) where + +{-- snippet imports --} +import qualified Data.ByteString.Lazy.Char8 as L +import Data.Char (isSpace) +{-- /snippet imports --} +import Debug.Trace + +{-- snippet Anymap --} +class Anymap a where + anyWidth :: a -> Int + anyHeight :: a -> Int + anyParse :: L.ByteString -> [a] + anyReadFile :: FilePath -> IO [a] +{-- /snippet Anymap --} + +{-- snippet Greymap --} +data Greymap = Greymap { + greyWidth :: Int + , greyHeight :: Int + , greyMax :: Int + , greyData :: L.ByteString + } deriving (Eq) +{-- /snippet Greymap --} + +{-- snippet Show --} +instance Show Greymap where + show (Greymap w h m _) = "Greymap " ++ show w ++ "x" ++ show h ++ + " " ++ show m +{-- /snippet Show --} + +{-- snippet parseP5.type --} +parseP5 :: L.ByteString -> Maybe (Greymap, L.ByteString) +{-- /snippet parseP5.type --} + +{-- snippet parseP5 --} +matchHeader :: L.ByteString -> L.ByteString -> Maybe L.ByteString + +-- "nat" here is short for "natural number", not "nathan torkington" +getNat :: L.ByteString -> Maybe (Int, L.ByteString) + +getBytes :: Int -> L.ByteString -> Maybe (L.ByteString, L.ByteString) + +parseP5 s = + case matchHeader (L.pack "P5") s of + Nothing -> Nothing + Just s1 -> + case getNat s1 of + Nothing -> Nothing + Just (width, s2) -> + case getNat (L.dropWhile isSpace s2) of + Nothing -> Nothing + Just (height, s3) -> + case getNat (L.dropWhile isSpace s3) of + Nothing -> Nothing + Just (maxGrey, s4) + | maxGrey > 255 -> Nothing + | otherwise -> + case getBytes 1 s4 of + Nothing -> Nothing + Just (_, s5) -> + case getBytes (width * height) s5 of + Nothing -> Nothing + Just (bitmap, s6) -> + Just (Greymap width height maxGrey bitmap, s6) +{-- /snippet parseP5 --} + +{-- snippet parseP5.functions --} +matchHeader h s + | h `L.isPrefixOf` s = Just (L.dropWhile isSpace (L.drop (L.length h) s)) + | otherwise = Nothing + +getNat s = case L.readInt s of + Nothing -> Nothing + Just (i, s') | i <= 0 -> Nothing + | otherwise -> Just (fromIntegral i, s') + +getBytes n s = let n' = fromIntegral n + ht@(h, t) = L.splitAt n' s + in if L.length h < n' + then Nothing + else Just ht +{-- /snippet parseP5.functions --} + +{-- snippet then --} +(>>>) :: Maybe a -> (a -> Maybe b) -> Maybe b +Nothing >>> _ = Nothing +Just v >>> f = f v +{-- /snippet then --} + +{-- snippet parseP5_take2 --} +parseP5_take2 :: L.ByteString -> Maybe (Greymap, L.ByteString) +parseP5_take2 s = + matchHeader (L.pack "P5") s >>> + \s -> skipSpace ((), s) >>> + (getNat . snd) >>> + skipSpace >>> + \(width, s) -> getNat s >>> + skipSpace >>> + \(height, s) -> getNat s >>> + \(maxGrey, s) -> getBytes 1 s >>> + (getBytes (width * height) . snd) >>> + \(bitmap, s) -> Just (Greymap width height maxGrey bitmap, s) + +skipSpace :: (a, L.ByteString) -> Maybe (a, L.ByteString) +skipSpace (a, s) = Just (a, L.dropWhile isSpace s) +{-- /snippet parseP5_take2 --} + +parseAllP5 :: L.ByteString -> [Greymap] +parseAllP5 s = case parseP5_take2 s of + Nothing -> [] + Just (g, s') -> g : parseAllP5 s' addfile ./examples/ch10/Parse.hs hunk ./examples/ch10/Parse.hs 1 +import qualified Data.ByteString.Lazy as L +import Data.ByteString.Lazy.Char8 (pack) +import Data.Char (chr, ord, isDigit) +import Data.Int (Int64) +import Data.Word (Word8) + +{-- snippet ParseState --} +data ParseState = ParseState { + string :: L.ByteString + , offset :: Int64 + } deriving (Show) +{-- /snippet ParseState --} + +{-- snippet Parse --} +newtype Parse a = Parse { + runParse :: ParseState -> Either String (a, ParseState) + } +{-- /snippet Parse --} + +{-- snippet identity --} +identity :: a -> Parse a +identity a = Parse (\s -> Right (a, s)) +{-- /snippet identity --} + +{-- snippet bail --} +bail :: String -> Parse a +bail err = Parse $ \s -> Left $ + "byte offset " ++ show (offset s) ++ ": " ++ err +{-- /snippet bail --} + +{-- snippet then --} +(==>) :: Parse a -> (a -> Parse b) -> Parse b +x ==> f = Parse (\st -> case runParse x st of + Left err -> Left err + Right (a, st') -> runParse (f a) st') +{-- /snippet then --} + +x ==>! f = x ==> \_ -> f + +{-- snippet parse --} +parse :: Parse a -> L.ByteString -> Either String a +parse f s = case runParse f (ParseState s 0) of + Left err -> Left err + Right (a, _) -> Right a +{-- /snippet parse --} + +instance Monad Parse where + return = identity + (>>=) = (==>) + fail = bail + +{-- snippet getPut --} +getState :: Parse ParseState +getState = Parse (\s -> Right (s, s)) + +putState :: ParseState -> Parse () +putState s = Parse (\_ -> Right ((), s)) +{-- /snippet getPut --} + +{-- snippet uncons --} +uncons :: L.ByteString -> Maybe (Word8, L.ByteString) +uncons s = if L.null s + then Nothing + else Just (L.head s, L.tail s) +{-- /snippet uncons --} + +{-- snippet parseByte --} +parseByte :: Parse Word8 +parseByte = + getState ==> \st -> + case uncons (string st) of + Nothing -> bail "no more input" + Just (c, s) -> let st' = st { string = s, offset = offset st + 1 } + in putState st' ==> \_ -> identity c +{-- /snippet parseByte --} + +peekByte :: Parse (Maybe Word8) +peekByte = getState ==> (identity . fmap fst . uncons . string) + +instance Functor Parse where + fmap f p = Parse (\st -> case runParse p st of + Left err -> Left err + Right (a, st') -> let fa = identity (f a) + in runParse fa st') + +parseEnd :: Parse Bool +parseEnd = (L.null . string) `fmap` getState + +w2c = chr . fromIntegral + +peekChar :: Parse (Maybe Char) +peekChar = fmap w2c `fmap` peekByte +parseChar :: Parse Char +parseChar = w2c `fmap` parseByte + +parseWhile :: (Char -> Bool) -> Parse [Char] +parseWhile p = peekChar ==> \mc -> + case mc of + Nothing -> identity [] + Just c | p c -> parseChar ==>! + parseWhile p ==> \cs -> + identity (c:cs) + | otherwise -> identity [] + +parseNat :: Parse Int +parseNat = parseWhile isDigit ==> \digits -> + if null digits + then bail "no more input" + else let n = read digits + in if n < 0 + then bail "integer overflow" + else identity n addfile ./examples/ch10/TreeMap.hs hunk ./examples/ch10/TreeMap.hs 1 +import Prelude hiding (Functor(..)) + +{-- snippet Tree --} +data Tree a = Node (Tree a) (Tree a) + | Leaf a + deriving (Show) +{-- /snippet Tree --} + +{-- snippet treeLengths --} +treeLengths (Leaf s) = Leaf (length s) +treeLengths (Node l r) = Node (treeLengths l) (treeLengths r) +{-- /snippet treeLengths --} + +{-- snippet treeMap --} +treeMap :: (a -> b) -> Tree a -> Tree b +treeMap f (Leaf a) = Leaf (f a) +treeMap f (Node l r) = Node (treeMap f l) (treeMap f r) +{-- /snippet treeMap --} + +{-- snippet Functor --} +class Functor f where + fmap :: (a -> b) -> f a -> f b +{-- /snippet Functor --} + +{-- snippet Functor.Tree --} +instance Functor Tree where + fmap = treeMap +{-- /snippet Functor.Tree --} + +{-- snippet Functor.List --} +instance Functor [] where + fmap = map +{-- /snippet Functor.List --} + +{-- snippet Functor.Maybe --} +instance Functor Maybe where + fmap _ Nothing = Nothing + fmap f (Just x) = Just (f x) +{-- /snippet Functor.Maybe --} addfile ./examples/ch10/functor.ghci hunk ./examples/ch10/functor.ghci 1 +:load TreeMap + +--# map + +map (+1) [1,2,3] +map show [1,2,3] +:type map show + +--# treeLengths + +let tree = Node (Leaf "foo") (Node (Leaf "x") (Leaf "quux")) +treeLengths tree +treeMap length tree +treeMap (odd . length) tree + +--# fmap + +fmap length ["foo","quux"] +fmap length (Tree (Leaf "livingstone") (Leaf "i presume")) + +--# EitherInt + +:load EitherInt + +--# EitherIntFlexible + +:load EitherIntFlexible +fmap (== "cheeseburger") (Left 1 :: Either Int String) +fmap (== "cheeseburger") (Right "fries" :: Either Int String) + +--# id + +fmap id (Tree (Leaf "a") (Leaf "b")) + +--# shape + +fmap odd (Just 1) +fmap odd Nothing + +--# composition + +(fmap even . fmap length) (Just "twelve") +fmap (even . length) (Just "twelve") addfile ./examples/ch10/parse.ghci hunk ./examples/ch10/parse.ghci 1 +--# parse +:load Parse +:type parse (identity 1) undefined }