In this post, I show you the bare minimum you need to know how to do UTF-8 string manipulation in Go safely.
Update 09/03/2022: Someone at Reddit pointed out that counting runes isn’t enough to slice strings correctly, given that Unicode has multi-codepoint glyphs, such as for flags.
I’ve updated the post to reflect that but couldn’t find a concise and straightforward solution.
Validate if string isn’t in another encoding or corrupted:
Get the right number of runes in a UTF-8 string:
fmt.Println(utf8.RuneCountInString("é um cãozinho")) // returns 13 as expected, but there's a gotcha (keep reading)
fmt.Println(len("é um cãozinho")) // returns 15 because 'é' and 'ã' are represented by two bytes each
To slice them in runes correctly, you might think to use utf8.DecodeRune or utf8.DecodeRuneInString to get the first rune and its size:
vardog = "é um cãozinho"_, offset:=utf8.DecodeRuneInString(dog)
dog = dog[offset:]
fmt.Printf("got: %q (valid: %v)\n", dog, utf8.ValidString(dog))
// got: " um cãozinho" (valid: true)
varbroken = "🇱🇮 is the flag for Liechtenstein"_, offset:=utf8.DecodeRuneInString(broken)
broken = broken[offset:]
fmt.Printf("got: %q (valid: %v)\n", broken, utf8.ValidString(broken))
// got: "🇮 is the flag for Liechtenstein" (valid: true)
This is not what we wanted (" is the flag for Liechtenstein"), but it’s still valid UTF-8.
Also, this is not a false positive: the leading rune is valid. Confusing, right?
vars = "🇱🇮: Liechtenstein"fmt.Printf("glyphs=%d runes=%d len(s)=%d\n", uniseg.GraphemeClusterCount(s), utf8.RuneCountInString(s), len(s))
fmt.Printf("First glyph runes: %x (bytes positions: %d-%d)\n", gr.Runes(), from, to)
fmt.Printf("slicing after first glyph: %q", s[to:])
// glyphs=16 runes=17 len(s)=23
// First glyph runes: [1f1f1 1f1ee] (bytes positions: 0-8)
// slicing after first glyph: ": Liechtenstein"
So, if you need to cut a string in a place that is not a clear " " (whitespace) or other symbols you can precisely define, you might want to walk through the glyphs one by one to do it safely.
In the early days of the web, websites from different regions used different character encodings, accordingly to their demographic region. Nowadays, most websites use the Unicode implementation known as UTF-8. Unicode defines 144,697 characters.
Here are some of the most popular encodings:
Windows code pages: characters sets for multiple languages
UTF-8 was created in 1992 by Ken Thompson and Rob Pike as a variable-width character encoding, originally implemented for the Plan 9 operating system. It is backward-compatible with ASCII. As of 2022, more than 97% of the content on the web is encoded with UTF-8.
Take a language with just a few extra glyphs like Portuguese or Spanish, and you’ll quickly notice the importance of handling encodings properly when writing software.
To show that, let me write a small program that will iterate over a string rune by rune assuming it is UTF-8 and print its representation:
Each of these preceding single characters in the variable examples is represented by one or more of what in character encoding terminology is called a code point, a numeric value that computers use to map, transmit, and store.
Now, UTF-8 is a variable-width encoding requiring one to four bytes (that is, 8, 16, 24, or 32 bits) to represent a single code point.
UTF-8 uses one byte for the first 128 code points (backward-compatible with ASCII), and up to 4 bytes for the rest.
While UTF-8 and many other encodings, such as ISO-8859-1, are backward-compatible with ASCII, their extended codespace aren’t compatible between themselves.
In the Go world, a code point is called a rune.
From Go’s src/builtin/builtin.go definition of rune, we can see it uses an int32 internally for each code point:
// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
typerune = int32
How can this affect you, anyway?
In the early days of the web (though this problem might happen elsewhere), whenever you tried to access content from another demographics for which your computer vendor didn’t prepare it to handle, you’d likely get a long series of □ or � replacement characters on your browser.
If you wanted to display, say, Japanese or Cyrillic correctly, not only you’d have to download a new font:
There was also a high chance of the website not setting encoding correctly, forcing you to manually adjust it on your browser (and hope it works).
With Unicode and UTF-8, this became a problem of the past.
Surely, I still cannot read any non-Latin language, but at least everyone’s computers now render beautiful Japanese or Chinese calligraphy just fine out of the box.
From a software development perspective, we need to be aware of several problems, such as how to handle strings manipulation correctly, as we don’t want to cause data corruption.
For that, when working with UTF-8 in Go, if you need to do any sort of string manipulation, such as truncating a long string, you’ll want to use the unicode/utf8 package.
Length of a string vs. the number of runes:
What is the length of the following words?