Haskell String Types Cheatsheet

People who are new to Haskell often wonder what's going on with «string» types. Specially those who still go by using strings and half-baked regex-based parsers, because they still don't know better. As I often say, «String is the poor man's data structure». But I digress...

This cheat sheet helps figuring out what the usual string types are, what are they useful for, and how to go back and forth among them.

The standard `String` type

Comes from the original Haskell design, and it has to be including in the Standard Prelude. There's nothing special about it, it's just a list of characters, i.e.

type String = [Char]

This simplicity is very important for pedagogical reasons: it helps learning how parametric polymorphism is way more powerful than subtype specialization. Any function that works on lists, works on String and that helps bring the point across.

However, this simplicity comes with limitations. For starters, it becomes very hard (but not impossible) to use String to represent anything beyond ASCII and printable Unicode. The worst part, is that being simple linked lists of boxed Char, makes them very inefficient for large-scale string processing, both in time and space. Don't use them for that.

The «human-readable» `Text`

A string data type that is optimized for efficiently handling «human readable» Char sequences. It provides an API offering the same functionality as String and regular list manipulation functions.

In order to use it, you'll need the text package, and import the module to get access to the API. It's 100% Unicode compatible and it's almost always a better choice than plain String. You get two choices:

«Strict» Text (Data.Text). Uses tigthly packed vectors of Unicode characters. Each value must fit in memory in order to be used.
«Lazy Text» ( Data.Text.Lazy ). Uses a list of «Strict» Text chunks. It allows larger than RAM lazy text processing.

Both variants are subject to fusion, in that if you use the API and higher-order functions (folds, maps, and scans), the compiler combines operations efficiently to minimize or downright eliminate intermediate structures.

This is the data type you want to read/write to text files, regular database columns, and pretty much anything that will be read or written by a human.

The «machine readable» `ByteString`

A string data type that is optimized for efficiently handling «machine readable» Byte sequences. It provides an API offering the same functionality as String and regular list manipulation functions, with efficiency similar to that of Text.

In order to use it, you'll need the bytestring package, and import the module to get access to the API. There's no implicit encoding, and it's the correct choice when serliazing binary data or dealing with interoperable protocols. You get two choices:

«Strict» ByteString (Data.ByteString). Uses tightly packed vectors of bytes (not characters). It's the programmer's responsibility to figure out whatever encoding necessary to interpret them.
«Lazy» ByteString (Data.ByteString.Lazy). Uses a list of «Strict» ByteString chunks. It allows larger than RAM lazy byte sequence processing.

This is the data type you want to read/write to binary files, deal with binary protocols and low-level data structures (JSON and XML), and pretty much anything that will be read or written by another program.

Usage and going back and forth

For the remaining cheat sheet, the following imports are assumed alongside the implicit Prelude with its String provision. Note that these naming conventions are not mandatory, yet kind of traditional.

Package utf8-string is needed for UTF8 encodings.

import Data.ByteString           as BS
import Data.ByteString.UTF8      as BSU
import Data.ByteString.Lazy      as BL
import Data.ByteString.Lazy.UTF8 as BLU
import Data.Text                 as TS
import Data.Text.Encoding        as TSE
import Data.Text.Lazy            as TL
import Data.Text.Lazy.Encoding   as TLE

You should always use the OverloadedString GHC extension. It will automatically convert literal strings into the correct type at compile time. Also, it ensures that Text literals are always correctly encoded.

You should never ever use Unicode inside a literal ByteString. It will not work. It's not meant to be used that way. No, the «language» is not wrong, YOU'RE wrong. Just to be clear, the following will always be wrong

wrongLitLBS :: BL.ByteString
wrongLitLBS = "these á ñ λ will be badly© encoded"

If you insist on having a literal ByteString with embedded Unicode (a thing that doesn't make sense, but alas), then you'll need

welpLitLBS :: BL.ByteString
welpLitLBS = BLU.fromString "these á ñ λ will be safely© encoded"

The conversions

Some libraries use String, some libraries use Text, some libraries use ByteString, some use all of them. You'll have a lazy something, but need a strict one, or viceversa.

For instance, you might have a domain name as a Text value, but in order to perform a DNS query, you need to build a DNS message that must be a ByteString. Or you might want to parse an XML/JSON payload, which comes in as a ByteString, and then turn it into some Text fields within Haskell data types.

That is, you'll need to convert among types and representations. Hence, this cheat-sheet:

`String` <-> `ByteString`

UTF8 encoding is explicit on String and assumed for the ByteString. If the latter is wrong, these won't work.

-- Strict ByteString
BSU.toString   :: BS.ByteString -> String
BSU.fromString :: String -> BS.ByteString
-- Lazy ByteString
BLU.toString   :: BL.ByteString -> String
BLU.fromString :: String -> BL.ByteString

`String` <-> `Text`

UTF8 encoding will always be correct.

-- Strict Text
TS.unpack :: TS.Text -> String
TS.pack   :: String -> TS.Text
-- Lazy Text
TL.unpack :: TL.Text -> String
TL.pack   :: String -> TL.Text

`ByteString` <-> `Text`

UTF8 encoding is explicit on Text and assumed for the ByteString. If the latter is wrong, these won't work.

-- Both are strict
TSE.encodeUtf8 :: TS.Text -> BS.ByteString
TSE.decodeUtf8 :: BS.ByteString -> TS.Text
-- Both are lazy
TLE.encodeUtf8 :: TL.Text -> BL.ByteString
TLE.decodeUtf8 :: BL.ByteString -> TL.Text

Lazy <-> Strict

-- Text
TL.fromStrict :: TS.Text -> TL.Text
TL.toStrict   :: TL.Text -> TS.Text
-- ByteString
BL.fromStrict :: BS.ByteString -> BL.ByteString
BL.toStrict   :: BL.ByteString -> BS.ByteString

GHCi and printing Unicode

When trying things out with ghci (or stack repl) it is often necessary to print Unicode strings, for String, Text, or properly encoded ByteString values.

The Standard Prelude provides the print function that, when invoked from within the REPL, turns out to be less than satisfying.

ghci> print "®øñá"
"\174\248\241\225"
it :: ()

However, package unicode-show provides the Unicode-aware uprint function that will do the trick

ghci> :m Text.Show.Unicode
ghci> uprint "®øñá"
"®øñá"
it :: ()

If you want to use uprint automatically instead of print when working at the REPL, install package unicode-show to your OS or global stack project, and add the following to the appropriate .ghci

import qualified Text.Show.Unicode
:set --interactive-print=Text.Show.Unicode.uprint

The standard String type

The «human-readable» Text

The «machine readable» ByteString