Text Manipulation with/without Parsec
-
Upload
ujihisa -
Category
Technology
-
view
1.944 -
download
1
description
Transcript of Text Manipulation with/without Parsec
![Page 1: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/1.jpg)
Text manipulation with/without parsec
October 11, 2011 Vancouver Haskell UnMeetup
Tatsuhiro Ujihisa
Tuesday, October 11, 2011
![Page 2: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/2.jpg)
• Tatsuhiro Ujihisa
• @ujm
• HootSuite Media inc
• Osaka, Japan
• Vim: 14
• Haskell: 5
Tuesday, October 11, 2011
![Page 3: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/3.jpg)
Topics• text manipulation functions with/
without parsec
• parsec library
• texts in Haskell
• attoparsec library
Tuesday, October 11, 2011
![Page 4: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/4.jpg)
Haskell for work• Something academical
• Something methematical
• Web app
• Better shell scripting
• (Improve yourself )
Tuesday, October 11, 2011
![Page 5: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/5.jpg)
Text manipulation• The concept of text
• String is [Char]
• lazy
• Pattern matching
Tuesday, October 11, 2011
![Page 6: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/6.jpg)
Example: split• Ruby/Python example
• 'aaa<>bb<>c<><>d'.split('<>')['aaa', 'bb', 'c', '', 'd']
• Vim script example
• split('aaa<>bb<>c<><>d', '<>')
Tuesday, October 11, 2011
![Page 7: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/7.jpg)
split in Haskell• split :: String -> String -> [String]
• split "aaa<>bb<>c<><>d" "<>"["aaa", "bb", "c", "", "d"]
• "aaa<>bb<>c<><>d" `split` "<>"
Tuesday, October 11, 2011
![Page 8: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/8.jpg)
Design of split• split "aaa<>bb<>c<><>d" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
• "aaa" : "bb" : split "c<><>d" "<>"
• "aaa" : "bb" : "c" : split "<>d" "<>"
• "aaa" : "bb" : "c" : "" : split "d" "<>"
• "aaa" : "bb" : "c" : "" : "d" split "" "<>"
• "aaa" : "bb" : "c" : "" : "d" : []
Tuesday, October 11, 2011
![Page 9: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/9.jpg)
Design of split• split "aaa<>bb<>c<><>d" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
Tuesday, October 11, 2011
![Page 10: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/10.jpg)
Design of split• split "aaa<>bb<>c<><>d" "<>"
• split' "aaa<>bb<>c<><>d" "" "<>"
• split' "aa<>bb<>c<><>d" "a" "<>"
• split' "a<>bb<>c<><>d" "aa" "<>"
• split' "<>bb<>c<><>d" "aaa" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
Tuesday, October 11, 2011
![Page 11: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/11.jpg)
• split "aaa<>bb<>c<><>d" "<>"
• split' "aaa<>bb<>c<><>d" "" "<>"
• split' "aa<>bb<>c<><>d" "a" "<>"
• split' "a<>bb<>c<><>d" "aa" "<>"
• split' "<>bb<>c<><>d" "aaa" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
1 split :: String -> String -> [String]2 str `split` pat = split' str pat ""3 4 split' :: String -> String -> String -> [String]5 split' "" _ memo = [reverse memo]6 split' str pat memo = let (a, b) = splitAt (length pat) str in7 ______________________if a == pat8 _________________________then (reverse memo) : (b `split` pat)9 _________________________else split' (tail str) pat (head str : memo)
Tuesday, October 11, 2011
![Page 12: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/12.jpg)
Another approach• Text.Parsec: v3
• Text.ParserCombinators.Parsec: v2
• Real World Haskell Parsec chapter
• csv parser
Tuesday, October 11, 2011
![Page 13: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/13.jpg)
Design of split• split "aaa<>bb<>c<><>d" "<>"
• many of
• any char except for the string of "<>"
• that separated by "<>" or the end of string
Tuesday, October 11, 2011
![Page 14: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/14.jpg)
1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat
Tuesday, October 11, 2011
![Page 15: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/15.jpg)
1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat
Any char
Except for end of the string or the pattern to separate(without consuming text)
Tuesday, October 11, 2011
![Page 16: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/16.jpg)
1 import qualified Text.Parsec as P 2 3 main = do 4 print $ abc1 "abc" -- True 5 print $ abc1 "abcd" -- False 6 print $ abc2 "abc" -- True 7 print $ abc2 "abcd" -- False 8 9 abc1 str = str == "abc"10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of11 Right _ -> True12 Left _ -> False
Tuesday, October 11, 2011
![Page 17: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/17.jpg)
1 import qualified Text.Parsec as P 2 3 main = do 4 print $ parenthMatch1 "(a (b c))" -- True 5 print $ parenthMatch1 "(a (b c)" -- False 6 print $ parenthMatch1 ")(a (b c)" -- False 7 print $ parenthMatch2 "(a (b c))" -- True 8 print $ parenthMatch2 "(a (b c)" -- False 9 print $ parenthMatch2 ")(a (b c)" -- False10 11 parenthMatch1 str = f str 012 where13 f "" 0 = True14 f "" _ = False15 f ('(':xs) n = f xs (n + 1)16 f (')':xs) 0 = False17 f (')':xs) n = f xs (n - 1)18 f (_:xs) n = f xs n
1 parenthMatch2 str = 2 case P.parse (f >> P.eof ) "parenthMatch" str of 3 Right _ -> True 4 Left _ -> False 5 where 6 f = P.many (P.noneOf "()" P.<|> g) 7 g = do 8 P.char '(' 9 f10 P.char ')'
Tuesday, October 11, 2011
![Page 18: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/18.jpg)
Parsec API• anyChar
• char 'a'
• string "abc"== string ['a', 'b', 'c']== char 'a' >> char 'b' >> char 'c'
• oneOf ['a', 'b', 'c']
• noneOf "abc"
• eofTuesday, October 11, 2011
![Page 19: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/19.jpg)
Parsec API (combinator)• >>, >>=, return, and fail
• <|>
• many p
• p1 `manyTill` p2
• p1 `sepBy` p2
• p1 `chainl` op
Tuesday, October 11, 2011
![Page 20: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/20.jpg)
Parsec API (etc)• try
• lookAhead p
• notFollowedBy p
Tuesday, October 11, 2011
![Page 21: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/21.jpg)
texts in Haskell
Tuesday, October 11, 2011
![Page 22: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/22.jpg)
three types of text• String
• ByteString
• Text
Tuesday, October 11, 2011
![Page 23: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/23.jpg)
String• [Char]
• Char: a UTF-8 character
• "aaa" is String
• List is lazy and slow
Tuesday, October 11, 2011
![Page 24: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/24.jpg)
ByteString• import Data.ByteString
• Base64
• Char8
• UTF8
• Lazy (Char8, UTF8)
• Fast. The default of snap
Tuesday, October 11, 2011
![Page 25: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/25.jpg)
ByteString (cont'd)
• OverloadedStrings with Char8
• Give type expliticly or use with ByteString functions
1 {-# LANGUAGE OverloadedStrings #-}2 import Data.ByteString.Char8 ()3 import Data.ByteString (ByteString)4 5 main = print ("hello" :: ByteString)
Tuesday, October 11, 2011
![Page 26: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/26.jpg)
ByteString (cont'd)
1 import Data.ByteString.UTF8 ()2 import qualified Data.ByteString as B3 import Codec.Binary.UTF8.String (encode)4 5 main = B.putStrLn (B.pack $ encode "こんにちは" :: B.ByteString)
Tuesday, October 11, 2011
![Page 27: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/27.jpg)
Text• import Data.Text
• import Data.Text.IO
• always UTF8
• import Data.Text.Lazy
• Fast
Tuesday, October 11, 2011
![Page 28: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/28.jpg)
Text (cont'd)
• UTF-8 friendly
1 {-# LANGUAGE OverloadedStrings #-}2 import Data.Text (Text)3 import qualified Data.Text.IO as T4 5 main = T.putStrLn ("こんにちは" :: Text)
Tuesday, October 11, 2011
![Page 29: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/29.jpg)
Parsec supports• String
• ByteString
Tuesday, October 11, 2011
![Page 30: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/30.jpg)
Attoparsec supports• ByteString
• Text
Tuesday, October 11, 2011
![Page 31: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/31.jpg)
Attoparsec• cabal install attoparsec
• attoparsec-text
• attoparsec-enumerator
• attoparsec-iteratee
• attoparsec-text-enumerator
Tuesday, October 11, 2011
![Page 32: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/32.jpg)
Attoparsec pros/cons• Pros
• fast
• text support
• enumerator/iteratee
• Cons
• no lookAhead/notFollowedBy
Tuesday, October 11, 2011
![Page 33: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/33.jpg)
Parsec and Attoparsec
1 import qualified Text.Parsec as P2 3 main = print $ abc "abc"4 5 abc str = case P.parse f "abc" str of6 Right _ -> True7 Left _ -> False8 f = P.string "abc"
1 {-# LANGUAGE OverloadedStrings #-}2 import qualified Data.Attoparsec.Text as P3 4 main = print $ abc "abc"5 6 abc str = case P.parseOnly f str of7 Right _ -> True8 Left _ -> False9 f = P.string "abc"
Tuesday, October 11, 2011
![Page 34: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/34.jpg)
return ()
Tuesday, October 11, 2011
![Page 35: Text Manipulation with/without Parsec](https://reader034.fdocuments.in/reader034/viewer/2022052321/55493fcfb4c9050a4d8b4fea/html5/thumbnails/35.jpg)
Practice• args "f(x, g())"
-- ["x", "g()"]
• args "f(, aa(), bb(c))"-- ["", "aa()", "bb(c)"]
Tuesday, October 11, 2011