[More parsec work John Goerzen **20070905064731] { hunk ./en/ch17-parsec.xml 37 - - - + Let's jump right in by writing some code for parsing a CSV file. + CSV files are often used as a plain text representation of spreadsheets + or databases. Each line is a record, and each field in the record is + separated from the next by a comma. There are ways of dealing with + fields that contain commas, but to start with, we won't worry about it. + + + This first example is much longer than it really needs to be. We will + introduce more Parsec features in a little bit that will shrink the + parser down to only four lines! + + &csv1.hs:all; + + Before looking at all this code in detail, + let's play with it a bit and see what + it does. + + &csv1.ghci:s1; + + That makes sense: parsing the empty string returns an empty list. + Let's try parsing a single cell. + + &csv1.ghci:s2; + + Look at that. Recall how we defined that each line must end with the + end-of-line character, and we didn't give it. Parsec's error message + helpfully indicated the line number and column number of the problem, + and even told us what it was expecting! Let's give it an end-of-line + character and continue experimenting. + + &csv1.ghci:s3; + + You can see that parseCSV is doing exactly what we + wanted it to do. It's even handling empty cells and empty lines + properly. + + + Let's take a look at the code for this example. We didn't use many + shortcuts here, so remember that this will get shorter and simpler! + + + We've built it from the + top down, + so our first function is csvFile. The type of this + function is GenParser Char st [[String]]. This + means that the type of the input is a sequence of characters, which is + exactly what a Haskell string is, since String is + the same as [Char]. It also means that we will + return a value of type [[String]]: a list of a list + of strings. The st can be ignored for now. + + + Parsec programmers usually omit type declarations, since we write so + many small functions. Haskell's type inference can figure it out. + We've listed the types for the first example here so you can get a + better idea of what's going on. You can always use + :t in &ghci; to inspect types as well. + + + The csvFile uses a &do; block. Parsec is a monadic + parser combinator, but we are not using the &IO; monad here. + As we discussed in fixme: insert ref, not all monads + are &IO; monads. + + + We start by running many line. + many is a function that takes a function as an + argument. It tries to repeatedly + parse the input using the function passed to it. It gathers up the + results from all that repeated parsing and returns a list of them. So, + here, we are storing the results of parsing all lines in + result. Then we look for the end-of-file indicator, + called eof. Finally, we return the + result. So, a CSV file is made of of many lines, + then the end of file. You can often read out Parsec functions in plain + English just like that. + + + Now we must answer the question: what is a line? We define the + line function to do just that. Reading the + function, we can see that a line consists of cells followed by the end + of line character. + + + So what are cells? We defined them in the cells + function. The cells of a line start with the content of the first + cell, then continue with the content of the remaining cells, if any. + The result is simply the first cell and the remaining cells assembled + into a list. + + Let's skip over remainingCells for a minute and + look at cellContent. A cell contains any number of + characters, but each character must not be a comma or end of line + character. The noneOf function matches one item, so + long as it isn't in the list of items that we pass. So, saying + many (noneOf ",\n") defines a cell the way we want + it. + + + Back in remainingCells, we have the first example of + a choice in Parsec. The choice operator is &<|>;. This operator + behaves like this: it will first try the parser on the left. If it + consumed no input, it will try the parser on the right. + + + So, in remainingCells, our task is to come up with + all the cells after the first. Recall that + cellContent uses noneOf ",\n". + So it will not consume the comma or end-of-line character from the + input. If we see a comma after parsing a cell, it means that at least + one more cell follows. Otherwise, we're done. So, our first choice in + remainingCells is char ','. This + parser simply matches the passed character in the input. If we found a + comma, we want this function to return the remaining cells on the line. + At this point, the "remaining cells" looks exactly like the start of + the line, so we call cells recursively to parse + them. If we didn't find a comma, we return the empty list, signifying + no remaining cells on the line. + + + Finally, we must define what the end-of-line indicator is. We set it + to char '\n', which will suit our purposes fine for + now. + + + At the very end of the program, we define a function + parseCSV that takes a &String; and parses it as a + CSV file. This function is just a shortcut that calls Parsec's + parse function, filling in a few parameters. + parse returns Either ParseError + [[String]] for the CSV file. If there was an error, the + return vaule will be &Left; with the error; otherwise, it will be + &Right; with the result. + + addfile ./examples/ch17/csv1.ghci hunk ./examples/ch17/csv1.ghci 1 +--# s1 +:l csv1.hs +parseCSV "" +--# s2 +parseCSV "hi" +--# s3 +parseCSV "hi\n" +parseCSV "line1\nline2\nline3\n" +parseCSV "cell1,cell2,cell3\n" +parseCSV "l1c1,l1c2\nl2c1,l2c2\n" +parseCSV "Hi,\n\n,Hello\n" addfile ./examples/ch17/csv1.hs hunk ./examples/ch17/csv1.hs 1 +{-- snippet all --} +import Text.ParserCombinators.Parsec + +{- A CSV file contains 0 or more lines, each of which is terminated + by the end-of-line character (eol). -} +csvFile :: GenParser Char st [[String]] +csvFile = + do result <- many line + eof + return result + +-- Each line contains 1 or more cells, separated by a comma +line :: GenParser Char st [String] +line = + do result <- cells + eol -- end of line + return result + +-- Build up a list of cells. Try to parse the first cell, then figure out +-- what ends the cell. +cells :: GenParser Char st [String] +cells = + do first <- cellContent + next <- remainingCells + return (first : next) + +-- The cell either ends with a comma, indicating that 1 or more cells follow, +-- or it doesn't, indicating that we're at the end of the cells for this line +remainingCells :: GenParser Char st [String] +remainingCells = + (char ',' >> cells) -- Found comma? More cells coming + <|> (return []) -- No comma? Return [], no more cells + +-- Each cell contains 0 or more characters, which must not be a comma or +-- EOL +cellContent :: GenParser Char st String +cellContent = + many (noneOf ",\n") + + +-- The end of line character is \n +eol :: GenParser Char st Char +eol = char '\n' + +parseCSV :: String -> Either ParseError [[String]] +parseCSV input = parse csvFile "(unknown)" input +{-- /snippet all --} addfile ./examples/ch17/csv2.hs hunk ./examples/ch17/csv2.hs 1 +{-- snippet all --} +import Text.ParserCombinators.Parsec + +{- A CSV file contains 0 or more lines, each of which is terminated + by the end-of-line character (eol). -} +csvFile = endBy line eol + +-- Each line contains 1 or more cells, separated by a comma +line = sepBy cell (char ',') + +-- Each cell contains 0 or more characters, which must not be a comma or +-- EOL +cell = many (noneOf ",\n") + +-- The end of line character is \n +eol = char '\n' + +parseCSV :: String -> Either ParseError [[String]] +parseCSV input = parse csvFile "(unknown)" input +{-- /snippet all --} }