Parsec Presentation

AST

tl;dr: Short introduction to Parsec for beginner.

  • The html presentation is here.
() () () () () ([1]{(#1)}) ([1]{(#1)}) () () () () () () () () ()

Parsec

by Yann Esposito

Parsing

Latin pars (ōrātiōnis), meaning part (of speech).

  • analysing a string of symbols
  • formal grammar.

Parsing in Programming Languages

Complexity:

Method Typical Example Output Data Structure
Splitting CSV Array, Map
Regexp email
  • Fixed Layout Tree
Parser Programming language
  • Most Data Structure

Parser & culture

In Haskell Parser are really easy to use.

Generally:

  • In most languages: split then regexp then parse
  • In Haskell: split then parse

Parsing Example

From String:

(1+3)*(1+5+9)

To data structure:

AST

Parsec

Parsec lets you construct parsers by combining high-order Combinators to create larger expressions.

Combinator parsers are written and used within the same programming language as the rest of the program.

The parsers are first-class citizens of the languages […]“

Haskell Wiki

Parser Libraries

In reality there are many choices:

attoparsec fast
Bytestring-lexing fast
Parsec 3 powerful, nice error reporting

Haskell Remarks (1)

spaces are meaningful

f x   -- ⇔ f(x) in C-like languages
f x y -- ⇔ f(x,y)

Haskell Remarks (2)

Don’t mind strange operators (<*>, <$>).
Consider them like separators, typically commas.
They are just here to deal with types.

Informally:

toto <$> x <*> y <*> z
-- ⇔ toto x y z
-- ⇔ toto(x,y,z) in C-like languages

Minimal Parsec Examples

whitespaces = many (oneOf "\t ")
number = many1 digit
symbol = oneOf "!#$%&|*+-/:<=>?@^_~"
" “ – whitespaces on” “ “” – whitespaces on “32” “32” – number on “32”


– number on " 32 “ “number” (line 1, column 1): unexpected ” “ expecting digit

Comparison with Regexp (Parsec)

data IP = IP Word8 Word8 Word8 Word8
ip = IP <$>                             -- number.number.number.number
       number <*  char '.' <*>          -- put that in an IP data type
       number <*  char '.' <*>
       number <*  char '.' <*>
       number
number = do
    x <- read <$> many1 digit -- read the number
    guard (0 <= x && x < 256) -- ensure it is 0 <= x < 256
    return (fromIntegral x)   -- returns a Word8

Comparison with Regexp (Perl Regexp)

# remark: 888.999.999.999 is accepted
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

# exact but difficult to read
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
  (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Also, regexp are unityped by nature.

Monadic style

number :: Parser String
number = many1 digit

number' :: Parser Int
number' = do
    -- digitString :: String
    digitString <- many1 digit
    return (read digitString)
"32" :: [Char]  -- number on "32"
32   :: Int     -- number' on "32"

Combining Monadic style (S = aSb | ε)

s = do
        a <- string "a"
        mid <- s
        b <- string "b"
        return (a ++ mid ++ b)
    <|> string ""
""          -- s on ""
"aaabbb"    -- s on "aaabbb"
"aabb"      -- s on "aabbb"
-- s on "aaabb"
S (line1 1, column 4):
unexpected end of input
expecting "b"

Combining Applicative style (S = aSb | ε)

s = concat3 <$> string "a" <*> s <*> char "b"
    <|> string ""
    where
        concat3 x y z = x ++ y ++ z

Applicative Style usefull with Data types

data IP = IP Word8 Word8 Word8 Word8

parseIP = IP <$>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number

monadicParseIP = do
    d1 <- number
    char '.'
    d2 <- number
    char '.'
    d3 <- number
    char '.'
    d4 <- number
    return (IP d1 d2 d3 d4)

Write number correctly

number :: Parser Word8
number = do
    x <- read <$> many1 digit
    guard (0 <= x && x < 256) <?>
        "Number between 0 and 255 (here " ++ show x ++ ")"
    return (fromIntegral x)
>>> test parseIP "parseIP" "823.32.80.113"
"parseIP" (line 1, column 4):
unexpected "."
expecting digit or Number between 0 and 255 (here 823)

So

  • combination of simple parsers
  • error messages with (<?>)
  • embed result in data type using Applicative style
  • Not shown, use another monad with the parser

Time to do something cool

Useful definition

try tries to parse and backtracks if it fails.

(<||>) parser1 parser2 = try parser1 <|> parser2

Scheme

Write Yourself a Scheme in 48 hours

Remember from text to data structure. Our data structure:

data LispVal =    Atom String     -- print or val-3 ...
                | Number Integer  -- 32
                | String String   -- "foo"
                | Bool Bool       -- #t or #f
                | List [LispVal]  -- (print "foo" "bar")

Next will parse String, Atom, Integer

Parse String

parseString :: Parser LispVal
parseString = do
    char '"'
    x <- many (noneOf "\"")
    char '"'
    return (String x)
-- parseString on '"toto"'
String "toto" :: LispVal
-- parseString on '" hello"'
String " hello" :: LispVal

Parse Atom

In Scheme true is #t and false #f. Which are also valid Atoms.

symbol :: Parser Char
symbol = oneOf "!#$%&|*+-/:<=>?@^_~"

parseAtom :: Parser LispVal
parseAtom = do
    first <- letter <|> symbol
    rest <- many (letter <|> digit <|> symbol)
    let atom = first:rest
    return $ case atom of
                "#t" -> Bool True
                "#f" -> Bool False
                _    -> Atom atom

Test parseAtom

-- parseAtom on '#t'
Bool True :: LispVal
-- parseAtom on '#f'
Bool False :: LispVal
-- parseAtom on 'some-atom'
Atom "some-atom" :: LispVal

Parse Number

parseNumber :: Parser LispVal
parseNumber = Number . read <$> many1 digit
-- parseNumber on '18'
Number 18 :: LispVal
-- parseNumber on '188930992344321234'
Number 188930992344321234 :: LispVal

Compose all parsers

parseExpr :: Parser LispVal
parseExpr = parseAtom
            <||> parseString
            <||> parseNumber

Test the parser

-- parseExpr on '188930992344321234'
Number 188930992344321234 :: LispVal
-- parseExpr on '#t'
Bool True :: LispVal
-- parseExpr on 'just-some-word'
Atom "just-some-word" :: LispVal
-- parseExpr on '%-symbol-start'
Atom "%-symbol-start" :: LispVal
-- parseExpr on '"a String"'
String "a String" :: LispVal

Recursive Parsers

parseList :: Parser LispVal
parseList = List <$>
    (char '(' *> sepBy parseExpr' spaces <* char ')' )

parseExpr' :: Parser LispVal
parseExpr' = parseAtom
             <||> parseString
             <||> parseNumber
             <||> parseList

Test Parse List

-- parseExpr' on '(foo (bar baz))'
List [Atom "foo",List [Atom "bar",Atom "baz"]] :: LispVal

-- parseExpr' on '(foo (bar)'
"parseExpr'" (line 1, column 11):
unexpected end of input
expecting white space, letter, "\"", digit, "(" or ")"

-- parseExpr' on '(((foo)) bar)'
List [List [List [Atom "foo"]],Atom "bar"] :: LispVal

Remark

Why Haskell?

Close relation between parser and data types.

data Product a b = Product a b
data Sum a b = A a | B b
productParser = do
                  parser1
                  parser2
sumParser = parser1 <||> parser2

Remark (2)

Similar structure:

data MyType = X Int Char | Y String

parseMyType =      (X <$> parseInt <*> parseChar) 
              <||> (Y <$> parseString)

Remark (3)

Similar structure:

data Answer = YES | NO | MAYBE
data MyType = X Int Answer | Y String

parseAnswer = parseYES <||> parseNO <||> parseMAYBE
parseMyType =      (X <$> parseInt <*> parseAnswer) 
              <||> (Y <$> parseString)

Conclusion

So Parser are more powerful than regular expression.
Parsec make it very easy to use.
Easy to read and to manipulate.

Notice how you could use parser as any other object in Haskell.
You could mapM them for example.

Any question?

Appendice (Do it yourself)

{-# LANGUAGE DeriveDataTypeable #-}
import Control.Applicative hiding (many, (<|>))
import Text.Parsec
import Data.Typeable
type Parser a = Parsec String () a

myparser = many1 letter

main :: IO ()
main = test myparser "myparser" "(my (string to) parse)"

test :: (Typeable a, Show a) => Parser a -> String -> String -> IO ()
test parser description string = do
    putStrLn $ "-- " ++ description ++ " on \"" ++ string ++ "\""
    let res = parse parser description string
    case res of
        Left  err   -> print err
        Right value -> putStrLn $ show value ++ " :: " ++ show (typeOf value)

Appendice (2)

Links to example code:

Comments

comments powered by Disqus
Published on 2013-10-09
Follow @yogsototh
Yann Esposito©
Done with Vim & nanoc Hakyll