Understanding String Data Type in Go
Strings in Go deserve special attention because they are implemented very differently in Go compared to other languages.
Let’s write a simple program. To define an empty variable of string type, use string keyword.
https://play.golang.org/p/vMDoeaV3RCYTo find length of a string, you can use len function. len function is made available in Go runtime, hence you don’t need to import it from any package.
len is a universal function to find length of any data type, it’s not particular to strings. We will learn about more Go’s built-in functions in upcoming tutorials.
https://play.golang.org/p/Kqj-TJMFyXP Which will print 11 to the console as string s has 11 characters including a space which is also a character. All characters in string Hello World are valid ASCII characters, hence we hope to see each character to occupy only a byte in memory (as ASCII characters in UTF-8 occupies 8 bits or 1 byte). Let's see that using for loop on a string.
In Go, a string is in effect a read-only slice of bytes. For now, imagine slice is like a simple array, we will learn about slices in upcoming lessons. Hence in the above case, we are seeing the byte (uint8) values of string swhich is internally a slice. Hence s[i] prints the decimal value of the byte held by the character. But to see individual characters, you can use %cformat string in Printf statement. You can also use %v format string to see byte value and %T to see data type of the value.
https://play.golang.org/p/wwqhgHcTeIU So you can see each letter shows decimal number which holds 8 bits or 1 byte of memory in type uint8.
As we know (read wikipedia page), UTF-8 character can be defined in memory size from 1 byte (ASCII compatible) to 4 bytes. Hence in Go, all characters are represented in int32 (size of 4 bytes) data type. A code unit is the number of bits an encoding uses for one single unit cell. So UTF-8 uses 8 bits and UTF-16 uses 16 bits for a code unit, that means UTF-8 needs minimum 8 bits or 1 byte to represent a character.
A code point is any numerical value that defines the character and this is represented by one or more code units depending on the encoding. As UTF-8 is compatible with ASCII, all ASCII characters are represented in a single byte (8 bits), hence UTF-8 needs only 1 code unit to represent them.
But the biggest question is, if all characters in UTF-8 are represented in int32, then why we are getting uint8 type in the above example. As said earlier, in Go, a string is a read-only slice of bytes. When we use lenfunction on a string, it calculates the length of that slice. When we use forloop, it loops around the slice returning one byte at a time or one code unitat a time. As so far, all our characters were in ASCII character set, the byte provided by for loop was a valid character or a code unit was, in fact, a code point. Hence %c in Printf statement could print valid a character from that byte value. But as we know, UTF-8 code point or character value can be represented by series of one or more bytes (max 4 bytes), what will happen in for loop we saw earlier if we introduce non-ASCII characters?
Let’s replace o in Hello to õ (LATIN SMALL LETTER O WITH TILDE, http://www.utf8-chartable.de) which has Unicode code point representation U+00F5 and it is represented by 2 code units (2 bytes) c3 b5 (hexadecimal representation). So instead of 6f for character o, we should expect c3 b5for character õ.
https://play.golang.org/p/rhueGpn4pDc From the above result, we got c3 b5 instead of 6f but characters of Hellõ World did not get printed very well. We also see that len(s) returns 12because len counts the number of bytes in a string and that caused this problem. As indexing a string (using for loop on it) accesses individual bytes, not characters. Hence c3 (decimal 195) in UTF-8 represents Ã and b5(decimal 181) represents µ (check here).
To avoid the above the chaos, Go introduces data type rune (synonym of code point) which is an alias of int32 and I told you (but not proved yet) that Go represents a character (code point) in int32 data type.
Interesting answer on why rune is int32 and not uint32 (as character code point value can not be negative and int32 data type can hold both negative and positive values) is here.
So, instead of a slice of bytes, we need to convert a string into a slice of runes.
https://play.golang.org/p/ELgL-upVnz_r We converted a string into a slice of runes using type conversion. Observe f5 in the above result instead of c3 b5 because we are iterating over runedata type and code point of õ in UTF-8 table is f5 (hence unicode code point representation U+00F5) or decimal 245 (check here). Also, we got the length 11 of string s which is correct, because there are 11 runes in the slice (or 11 code points or 11 characters). And we also proved that a code point or a character in Go is represented by int32 data type.
If you use range within for loop, range will return rune and byte index of the character.
https://play.golang.org/p/Xet2cJbywLH In the above program, we lost index 5 because the 5th byte is second code unit of õ character. If you don’t need index value, you can ignore it by using _ (blank identifier) instead.
Strings are a slice of bytes, simple as that. When we use for loop with range, we get rune because each character in the string is represented by rune data type. In Go, a character is represented between single quote AKA character literal. Hence, any valid UTF-8 character within a single quote (') is a runeand it’s type is int32.
As seen from the earlier definition of strings, they are a read-only slice of bytes. Hence, if we try to replace any byte in the slice, the compiler will throw an error.
String literals using backtick
Instead of double quotes, we can also use backtick (`) character to represent a string in Go. In quotes (“) you need to escape newlines, tabs and other characters that do not need to be escaped in backticks. If you put a line break in a backtick string, it is interpreted as a ‘\n’ character, see https://golang.org/ref/spec#String_literals The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the backticks; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (\r) inside raw string literals are discarded from the raw string value. - GoLang documentation
Let’s see a small example
https://play.golang.org/p/9Ir-0Lxx0u3We can see that original formatting of the string with newline, tab and double quotes persisted in the output and newline character \n did nothing while carriage return \r was discarded.
As character represented in single quotes in Go is rune and rune can be compared because they represent Unicode code points (int32 values). Hence if a character has more decimal value, it will be greater than the character which has lower.
Let’s see a very simple example.
https://play.golang.org/p/aw8Sv8Vto-c Since we know that characters are nothing but int32 internally, we can do all sorts of comparisons with them. For example, a for loop between two character value range.
This was a basic introduction to Strings in Go but there are many utility functions provided by strings package that can be used to perform all sorts of operations on string like join, replace, search etc. strings package is a part of Go’s standard library.