UTF-8 strings with Go: len(s) isn't enough
Monday, 7 March 2022.
In this post, I show you the bare minimum you need to know how to do UTF-8 string manipulation in Go safely.
Update 09/03/2022: Someone at Reddit pointed out that counting runes isn’t enough to slice strings correctly, given that Unicode has multi-codepoint glyphs, such as for flags. I’ve updated the post to reflect that but couldn’t find a concise and straightforward solution.
Read also: Back to basics: Writing an application using Go and PostgreSQL and Homelab: Intel NUC with the ESXi hypervisor.
tl;dr
Use the unicode/utf8
package to:
- Validate if string isn’t in another encoding or corrupted:
|
|
- Get the right number of runes in a UTF-8 string:
|
|
But here is a surprise:
|
|
- Strings might get corrupted if you try to slice them directly with taking its binary length:
|
|
To slice them in runes correctly, you might think to use utf8.DecodeRune
or utf8.DecodeRuneInString
to get the first rune and its size:
|
|
Then, this:
|
|
This is not what we wanted (" is the flag for Liechtenstein"
), but it’s still valid UTF-8.
Also, this is not a false positive: the leading rune is valid. Confusing, right?
Turns out Unicode text segmentation is harder than I expected as some glyphs uses multiple codepoints (runes).
The package github.com/rivo/uniseg provides a limited API that can help you with that:
|
|
So, if you need to cut a string in a place that is not a clear " " (whitespace) or other symbols you can precisely define, you might want to walk through the glyphs one by one to do it safely.
Background
In the early days of the web, websites from different regions used different character encodings, accordingly to their demographic region. Nowadays, most websites use the Unicode implementation known as UTF-8. Unicode defines 144,697 characters. Here are some of the most popular encodings:
Encoding | Use |
---|---|
UTF-8 | Unicode Standard (International) |
ISO-8859-1 | Western European languages (includes English) |
ISO-8859-2 | Eastern European languages |
ISO-8859-5 | Cyrillic languages |
GB 2312 | Simplified Chinese |
Shift JIS | Japanese |
Windows-125x series | Windows code pages: characters sets for multiple languages |
… | … |
UTF-8 was created in 1992 by Ken Thompson and Rob Pike as a variable-width character encoding, originally implemented for the Plan 9 operating system. It is backward-compatible with ASCII. As of 2022, more than 97% of the content on the web is encoded with UTF-8. See:
- Unicode over 60 percent of the web (Google, February 3, 2012)
- Historical yearly trends in the usage statistics of character encodings for websites (since 2011)
Why should you care?
Take a language with just a few extra glyphs like Portuguese or Spanish, and you’ll quickly notice the importance of handling encodings properly when writing software. To show that, let me write a small program that will iterate over a string rune by rune assuming it is UTF-8 and print its representation:
|
|
The last two bytes on lines 38 and 39 are for 🇱🇮.
Each of these preceding single characters in the variable examples
is represented by one or more of what in character encoding terminology is called a code point, a numeric value that computers use to map, transmit, and store.
Now, UTF-8 is a variable-width encoding requiring one to four bytes (that is, 8, 16, 24, or 32 bits) to represent a single code point.
UTF-8 uses one byte for the first 128 code points (backward-compatible with ASCII), and up to 4 bytes for the rest.
While UTF-8 and many other encodings, such as ISO-8859-1, are backward-compatible with ASCII, their extended codespace aren’t compatible between themselves.
In the Go world, a code point is called a rune.
From Go’s src/builtin/builtin.go definition of rune, we can see it uses an int32 internally for each code point:
|
|
How can this affect you, anyway?
In the early days of the web (though this problem might happen elsewhere), whenever you tried to access content from another demographics for which your computer vendor didn’t prepare it to handle, you’d likely get a long series of □ or � replacement characters on your browser.
If you wanted to display, say, Japanese or Cyrillic correctly, not only you’d have to download a new font: There was also a high chance of the website not setting encoding correctly, forcing you to manually adjust it on your browser (and hope it works).
unicode/utf8
With Unicode and UTF-8, this became a problem of the past. Surely, I still cannot read any non-Latin language, but at least everyone’s computers now render beautiful Japanese or Chinese calligraphy just fine out of the box.
From a software development perspective, we need to be aware of several problems, such as how to handle strings manipulation correctly, as we don’t want to cause data corruption.
For that, when working with UTF-8 in Go, if you need to do any sort of string manipulation, such as truncating a long string, you’ll want to use the unicode/utf8 package.
Length of a string vs. the number of runes: What is the length of the following words?
|
|
As you can see, neither len(s) nor runes can be used to count the number of glyphs properly.
Using github.com/rivo/uniseg, you can iterate between graphemes like this:
|
|
You can use the following functions to get the number of runes (not glyphs):
|
|
To get the exact number of glyphs, you might want to try out github.com/rivo/uniseg:
|
|
To validate if a string is consists entirely of valid UTF-8-encoded runes use the following functions:
|
|
Example:
|
|
Converting between encodings
To convert to/from other encodings, you might use the golang.org/x/text/encoding/charmap package.
Example:
|
|
If you need help reading a malformed string, or getting runes individually, read the documentation for the unicode/utf8 package and check out its examples.
References
- Strings, bytes, runes and characters in Go by Rob Pike.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.