People who are new to Haskell often wonder what's going on with «string» types. Specially those who still go by using strings and half-baked regex-based parsers, because they still don't know better. As I often say, «String is the poor man's data structure». But I digress...
This cheat sheet helps figuring out what the usual string types are, what are they useful for, and how to go back and forth among them.
The standard String
type
Comes from the original Haskell design, and it has
to be including in the Standard Prelude
. There's
nothing special about it, it's just a list of
characters, i.e.
type String = [Char]
This simplicity is very important for pedagogical
reasons: it helps learning how parametric polymorphism
is way more powerful than subtype specialization.
Any function that works on lists, works on String
and
that helps bring the point across.
However, this simplicity comes with limitations. For
starters, it becomes very hard (but not impossible)
to use String
to represent anything beyond ASCII and
printable Unicode. The worst part, is that being
simple linked lists of boxed Char
, makes them very
inefficient for large-scale string processing, both in
time and space. Don't use them for that.
The «human-readable» Text
A string data type that is optimized for efficiently
handling «human readable» Char
sequences. It provides
an API offering the same functionality as String
and
regular list manipulation functions.
In order to use it, you'll need the text
package, and
import the module to get access to the API. It's 100%
Unicode compatible and it's almost always a better
choice than plain String
. You get two choices:
«Strict» Text (
Data.Text
). Uses tigthly packed vectors of Unicode characters. Each value must fit in memory in order to be used.«Lazy Text» (
Data.Text.Lazy
). Uses a list of «Strict» Text chunks. It allows larger than RAM lazy text processing.
Both variants are subject to fusion, in that if you use the API and higher-order functions (folds, maps, and scans), the compiler combines operations efficiently to minimize or downright eliminate intermediate structures.
This is the data type you want to read/write to text files, regular database columns, and pretty much anything that will be read or written by a human.
The «machine readable» ByteString
A string data type that is optimized for efficiently
handling «machine readable» Byte
sequences. It provides
an API offering the same functionality as String
and
regular list manipulation functions, with efficiency
similar to that of Text
.
In order to use it, you'll need the bytestring
package,
and import the module to get access to the API. There's
no implicit encoding, and it's the correct choice when
serliazing binary data or dealing with interoperable
protocols. You get two choices:
«Strict» ByteString (
Data.ByteString
). Uses tightly packed vectors of bytes (not characters). It's the programmer's responsibility to figure out whatever encoding necessary to interpret them.«Lazy» ByteString (
Data.ByteString.Lazy
). Uses a list of «Strict» ByteString chunks. It allows larger than RAM lazy byte sequence processing.
Both variants are subject to fusion, in that if you use the API and higher-order functions (folds, maps, and scans), the compiler combines operations efficiently to minimize or downright eliminate intermediate structures.
This is the data type you want to read/write to binary files, deal with binary protocols and low-level data structures (JSON and XML), and pretty much anything that will be read or written by another program.
Usage and going back and forth
For the remaining cheat sheet, the following imports
are assumed alongside the implicit Prelude
with its
String
provision. Note that these naming conventions are
not mandatory, yet kind of traditional.
Package utf8-string
is needed for UTF8 encodings.
import Data.ByteString as BS
import Data.ByteString.UTF8 as BSU
import Data.ByteString.Lazy as BL
import Data.ByteString.Lazy.UTF8 as BLU
import Data.Text as TS
import Data.Text.Encoding as TSE
import Data.Text.Lazy as TL
import Data.Text.Lazy.Encoding as TLE
You should always use the OverloadedString
GHC extension.
It will automatically convert literal strings into the
correct type at compile time. Also, it ensures that
Text
literals are always correctly encoded.
You should never ever use Unicode inside a literal
ByteString
. It will not work. It's not meant to be used
that way. No, the «language» is not wrong, YOU'RE wrong.
Just to be clear, the following will always be wrong
wrongLitLBS :: BL.ByteString
wrongLitLBS = "these á ñ λ will be badly© encoded"
If you insist on having a literal ByteString
with embedded
Unicode (a thing that doesn't make sense, but alas), then
you'll need
welpLitLBS :: BL.ByteString
welpLitLBS = BLU.fromString "these á ñ λ will be safely© encoded"
The conversions
Some libraries use String
, some libraries use Text
, some
libraries use ByteString
, some use all of them. You'll have
a lazy something, but need a strict one, or viceversa.
For instance, you might have a domain name as a Text
value,
but in order to perform a DNS query, you need to build a
DNS message that must be a ByteString
. Or you might
want to parse an XML/JSON payload, which comes in as
a ByteString
, and then turn it into some Text
fields
within Haskell data types.
That is, you'll need to convert among types and representations. Hence, this cheat-sheet:
String
<-> ByteString
UTF8 encoding is explicit on String
and assumed for the
ByteString
. If the latter is wrong, these won't work.
-- Strict ByteString
BSU.toString :: BS.ByteString -> String
BSU.fromString :: String -> BS.ByteString
-- Lazy ByteString
BLU.toString :: BL.ByteString -> String
BLU.fromString :: String -> BL.ByteString
String
<-> Text
UTF8 encoding will always be correct.
-- Strict Text
TS.unpack :: TS.Text -> String
TS.pack :: String -> TS.Text
-- Lazy Text
TL.unpack :: TL.Text -> String
TL.pack :: String -> TL.Text
ByteString
<-> Text
UTF8 encoding is explicit on Text
and assumed for the
ByteString
. If the latter is wrong, these won't work.
-- Both are strict
TSE.encodeUtf8 :: TS.Text -> BS.ByteString
TSE.decodeUtf8 :: BS.ByteString -> TS.Text
-- Both are lazy
TLE.encodeUtf8 :: TL.Text -> BL.ByteString
TLE.decodeUtf8 :: BL.ByteString -> TL.Text
Lazy <-> Strict
-- Text
TL.fromStrict :: TS.Text -> TL.Text
TL.toStrict :: TL.Text -> TS.Text
-- ByteString
BL.fromStrict :: BS.ByteString -> BL.ByteString
BL.toStrict :: BL.ByteString -> BS.ByteString
GHCi and printing Unicode
When trying things out with ghci
(or stack repl
) it is
often necessary to print Unicode strings, for String
,
Text
, or properly encoded ByteString
values.
The Standard Prelude
provides the print
function
that, when invoked from within the REPL, turns out to
be less than satisfying.
ghci> print "®øñá"
"\174\248\241\225"
it :: ()
However, package unicode-show
provides the Unicode-aware
uprint
function that will do the trick
ghci> :m Text.Show.Unicode
ghci> uprint "®øñá"
"®øñá"
it :: ()
If you want to use uprint
automatically instead of print
when
working at the REPL, install package unicode-show
to your
OS or global stack
project, and add the following to the
appropriate .ghci
import qualified Text.Show.Unicode
:set --interactive-print=Text.Show.Unicode.uprint