Table of contents
- Vector: array slices
- Bytes: Word8 vector
- Text: UTF-8 encoded Bytes
- Print to Text
- List fusion
- Type cheatsheet
Vector: array slices
In Z.Haskell, we use immutable arrays a lot. And we have two main array slice types in Z.Data.Vector
:
-- The payloads are array offset and length
data Vector a = Vector (SmallArray a) Int Int
data PrimVector a = PrimVector (PrimArray a) Int Int
...
These types can support efficiently slicing operations(take
, drop
, break
, etc.), To abstract these types, The Vec
class is introduced:
class (Arr (IArray v) a) => Vec v a where
-- | Vector's immutable array type
type IArray v :: Type -> Type
-- | Get underline array and slice range(offset and length).
toArr :: v a -> (IArray v a, Int, Int)
-- | Create a vector by slicing an array(with offset and length).
fromArr :: IArray v a -> Int -> Int -> v a
Vector
and PrimVector
are obvious instances, but plain array types are also Vec
’s instances with O(n)
fromArr
, for example:
instance Prim a => Vec PrimArray a where
type IArray PrimArray = PrimArray
toArr arr = (arr, 0, sizeofArr arr)
fromArr = fromArray
-- | Construct a slice from an array by copying(if neccessary).
fromArray :: Arr arr a => arr a -> Int -> Int -> arr a
fromArray arr offset len | offset == 0 && sizeofArr arr == len = arr
| otherwise = cloneArr arr offset len
These instances give Vec
great flexiblity: if your combinators are implemented with Vec
, it will work on various slicing types, and plain array types, for example, the map'
combinator from Z.Data.Vector
:
map' :: forall u v a b. (Vec u a, Vec v b) => (a -> b) -> u a -> v b
Note the input and output Vec
type is not required to be the same, which means applications like the following are possible:
data User = User { ..., age :: Int, ...}
-- | Take all user's age and pack them into a `PrimArray`.
takeAllAges :: Vector User -> PrimArray Int
takeAllAges = map' age
The above functions will work efficiently as expected, User
’s age will be directly written into a new PrimArray
with no extra copies.
All functions in Z.Data.Vector
are implemented using Vec
constraint, sometimes this will lead to type inference failures, so it’s recommended to enable TypeApplications
extension and add necessary type annotations:
{-# LANUAGE TypeApplications #-}
import qualified Z.Data.Vector as V
...
-- if you don't write annotations, GHC may get confused
-- which type of vectors you want to pack.
let v = V.pack @PrimVector @Word [1..1024]
...
Bytes: Word8 vector
One of the most commonly used vector types is type Bytes = PrimVector Word8
, which is used to represent binary data. To make writing Bytes
literals more convenient, Bytes
is an instance to IsString
:
> import qualified Z.Data.Vector as V
> :set -XOverloadedStrings
> "hello, world" :: V.Bytes
"hello, world"
> "你好世界" :: V.Bytes -- unicode literals will be get choped!
[96,125,22,76]
In the above example, unicode literals “你好世界” do not produce UTF-8 encoded byte vector as one might expect, you have to use Text
to get that behaviour:
> import qualified Z.Data.Text as T
> T.getUTF8Bytes "你好世界"
[228,189,160,229,165,189,228,184,150,231,149,140]
Note that Bytes
’s Show
instance is not specialized to show ASCII characters. You can use functions from Z.Data.Vector.Hex
and Z.Data.Vector.Base64
to manually encode binary Bytes
into ASCII strings:
> import Z.Data.Vector.Hex
> hexEncode True "hello world"
"68656C6C6F20776F726C64"
> import Z.Data.Vector.Base64
> base64Encode "hello wolrd"
"aGVsbG8gd29scmQ="
In Z-Data
we use incoherent instances to handle Bytes
’s JSON instance(using base64 encoding):
> V.pack [0..127] :: V.Bytes
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]
> import qualified Z.Data.JSON as JSON
> JSON.encode (V.pack [0..127] :: V.Bytes)
"\"AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1Njc4OTo7PD0+P0BBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWltcXV5fYGFiY2RlZmdoaWprbG1ub3BxcnN0dXZ3eHl6e3x9fn8=\""
> JSON.encode (V.pack [0..127] :: V.PrimVector Int)
"[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]"
Besides special instances, many functions in Z.Data.Vector
will leverage rewrite rules to use more efficient instructions when used with Bytes
, such as break
, takeWhile
, etc. But these optimizations should have no visible difference for users.
Text: UTF-8 encoded Bytes
The Text
type from Z.Data.Text
is a newtype
wrapper around Bytes
which provides UTF-8 encoding guarantee, you should construct a Text
using validate
or validateMaybe
or string literals only:
> import qualified Z.Data.Text as T
> T.validate "hello world"
"hello world"
> T.validate "hello world, \128"
*** Exception: InvalidUTF8Exception [("validate",SrcLoc {srcLocPackage = "interactive", srcLocModule = "Ghci12", srcLocFile = "<interactive>", srcLocStartLine = 52, srcLocStartCol = 1, srcLocEndLine = 52, srcLocEndCol = 31})]
> "你好世界" :: T.Text
"你好世界"
In Haskell, String
s are allowed to have illegal UTF-8 code points so that any UNIX file path can be encoded in String
, but in Z.Haskell we have a special type for file path. Text
will convert illegal code points in case of string literals:
> "hello world, \55296" :: T.Text
"hello world, �"
> T.getUTF8Bytes "hello world, \55296" -- surrogates
[104,101,108,108,111,32,119,111,114,108,100,44,32,239,191,189]
The 239, 191, 189
bytes sequence is the replacement char \U+FFFD
’s UTF-8 encoding form. By providing limited ways of creating Text
, combinators in Z.Data.Text
can safely assume Text
only contain UTF-8 encoded code points.
Z.Data.Text
also provide some unicode processing capabilities, such as normalization, case-mapping, etc:
> T.validate "re\204\129sume\204\129"
> "résumé"
> T.normalize (T.validate "re\204\129sume\204\129")
> "résumé"
> T.getUTF8Bytes $ (T.validate "re\204\129sume\204\129")
[114,101,204,129,115,117,109,101,204,129]
> T.getUTF8Bytes $ T.normalize (T.validate "re\204\129sume\204\129")
[114,195,169,115,117,109,195,169]
> T.toUpper "διακριτικός"
"ΔΙΑΚΡΙΤΙΚΌΣ"
Regex expressions based on re2 regex engine is also provided:
> import qualified Z.Data.Text.Regex as RE
> let emailRegex = RE.regex "([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,6})"
> RE.match emailRegex "hello@world.com"
("hello@world.com",[Just "hello",Just "world",Just "com"],"")
> RE.match emailRegex "foobar"
("",[],"foobar")
> RE.replace emailRegex True "hello@world.com, foo@bar.com" "x@y.z"
"x@y.z, x@y.z"
> RE.extract emailRegex "hello@world.com" "http://\\2.\\3"
"http://world.com"
Print to Text
Z.Data.Text
module provide toText
quickly convert a data type to Text
based on Print
class, it’s similar to Show
using Generic
support:
> import GHC.Generics
> import qualified Z.Data.Text as T
> data Date = Date { year :: Int, month :: Int, day :: Int } deriving (Generic, T.Print)
> T.toText $ Date 2020 1 12
"Date {year = 2020, month = 1, day = 12}"
It’s recommend to deriving Print
for your data types to get fast text conversion, though current GHC compile Generic
fair slowly.
List fusion
Vec
instances and Text
support the build-foldr fusion by providing fusion rules enabled pack/unpack
, the following code should iterate the input vector and produce the output vector in a single pass rather than producing an intermediate list:
f :: V.Vector a -> V.Vector b
f = V.pack . filter h . map g . V.unpack
This is different from the following code, which will produce an intermediate vector (may not be slower though):
f :: V.Vector a -> V.Vector b
f = V.filter h . V.map' g
When working with sequential data, it’s recommended to choose vectors as the final representation of data, since it’s more compact and GC friendly.
Type cheatsheet
Z-Data simplified a lot of types already, but in case of getting confused, here’s a type cheat sheet:
+---------------------------------------------------------+
| Vec class | + Use Array to save ADTs.
| | + Use SmallArray if you don't
| +----------------------+ +-----------------------+ | often mutate.
| | Arr class | | Slice types | | + Use PrimArray to save
| | | | support O(1) slicing | | primitive types like
| | +---------+ | | with offset/length | | Int or Word8.
| | | Array a | | | | | + Use UnliftedArray to save
| | +---------+ | | | | unlifted types like
| | | | | | IORef or Array.
| | +---------------+ | | | |
| | |UnliftedArray a| | | | | + Use slice types to get O(1)
| | +---------------+ | | | | slicing operations.
| | | | | | + Use Bytes to represent
| | +--------------+ | | +----------+ | | binary data.
| | | SmallArray a +->arrVec->+ Vector a | | |
| | +--------------+ | | +----------+ | | + Use Text to represent
| | | | | | UTF-8 encoded bytes.
| | +-------------+ | | +--------------+ | |
| | | PrimArray a +->arrVec->-+ PrimVector a | | |
| | +-------------+ | | +--------------+---+ | |
| | | | | Bytes | | |
| | | | | PrimVector Word8 | | |
| | | | +-------+----------+ | |
| +----------------------+ +----------V------------+ |
+----------------------------------------|----------------+
validate
|
V
+--------+------------+
| Text |
| UTF-8 encoded Bytes |
+---------------------+