Demystifying Encoding in MariaDB/MySQL: Tips to prevent Data Nightmares
This blog aims at clearing the fog around the often avoided topic of encoding. Most of the time one hopes for the best and comes through with that, but when encoding problems hit, they hit hard and cost a lot of time to solve. At least if you don't know what you're dealing with. The good news is, you can prevent encoding problems and I encourage you to do so.
This blog post is loosely oriented at and kind of a short version of http://mysql.rjweb.org/doc.php/charcoll.
I'll cover everything that is essential to know up front and then play through some example scenarios in detail. At the end there will be a short summary on best practices and problem solving techniques.
Encoding – What is that?
Computers communicate using bits. A bit is the smallest information unit and is either 0 or 1. Encoding is the translation of bits, to be concrete a sequence of bits, to a character and vice versa. For example the bit sequence 01000001 can be translated to "A" in UTF-8 encoding. But be aware that the bit sequence for an "A" might look different in another encoding.
Useful Knowledge Nuggets
- bit vs. byte: We just heard about bits. A sequence of 8 bits (i.e. the UTF-8 "A" from above) is called a byte. One byte can represent 256 different states, because 2^8 = 256.
- character set vs. collation: When you create a database or a table in MariaDB/MySQL you can set a character set and a collation. The character set basically tells the database which encoding to use when data (in form of bits) is inserted. The collation on the other hand is a set of rules that determines how the different characters are compared. For example the collation decides if "a" and "A" are equal or if "oe" and "ö" are. Handy thing to know when we sort or filter the data, do a group by and so on.
- Mojibake: The japanese word describes the garbled characters resulting from using the wrong encoding, producing nonsense or unreadable text. The German word for it translates to "letter salad" by the way.
- hexadecimal: Talking about encoding we can't avoid hexadecimals. Hexadecimals are used as kind of an abbreviation for bit/byte sequences. The table shows how the hexadecimal system works. The hex for the example of the UTF-8 "A" from the beginning would be 0100 0001 -> 41.
Hexadecimal Bits Decimal 0 0000 0 1 0001 1 2 0010 2 3 0011 3 4 0100 4 5 0101 5 6 0110 6 7 0111 7 8 1000 8 9 1001 9 A 1010 10 B 1011 11 C 1100 12 D 1101 13 E 1110 14 F 1111 15 - Unicode vs. UTF-8: The Unicode Standard provides a map of identifiers for more or less all natural language characters there are. It assigns every character an identifier, for example the identifier for "A" is U+0041. How these identifiers are saved in terms of bytes is the job of the encoding, UTF-8 is one of these encodings.
Where does encoding play a role anyways?
You won't like this. Encoding literally plays a role everywhere. The database has an encoding, so does every table and every single column in it. And they could all be different. The server the database is running on as well as the client that connects to the database, the machine the client runs on and the connection to the database itself all have encoding settings. Don't give up here, there are a few key points you have to watch out for. Just notice that in the following I will assume that all tables and columns in a database have the same encoding setting (character set and collation) – which I also strongly recommend! So now let's get back to which encoding setting you should look out for: client/data, connection and database/table.
client/data
If you insert the data in your client and send them to the database, the data will use the encoding of the client. For example if PyCharm is set to UTF-8 and I type an "A" to send to the database, it will be given as 01000001 to the connection. However, if you use the client to get the data from elsewhere, like the web, a file or an app, it might have another encoding than the setting in your client.
connection
When you send or query data from a database using a client, you first connect to the database. In this connection there is always an encoding set. It's basically just the information which encoding the data I'm sending to the database has. If you don't do it manually, it is assumed. So it should always be set manually, if you're not looking for trouble. For this, connect to the database and use the query "SET NAMES UTF-8;" if you want to set the connection to UTF-8. Do the insert or query statements afterwards. Why is this so important? I'll come back to it.
database/table
As you remember each database, table and column can have its own encoding settings in terms of character set and collation. Make sure to set this intentionally and don't let the default settings mess things up. Again, more on that later.
How to determine a file's or a text's encoding
Unfortunately, you can't determine the encoding of a text with absolute certainty. Depending on what you're working with, there are libraries and commands for estimating the encoding of a file or text. These are educated guesses, but a good guess is better than nothing.
In Python there's the detect function from the chardet library, which will find the most probable encoding of your file. In Linux you can also make an educated guess on the command line without any package, by typing file -i <name>.txt
for example.
Encoding workflow – where the puzzles stick together
So far I hope I got all prerequisites covered that are vital to understand what is going on here. At this point I'm going to play through some encoding scenes in detail. But first let's go through a plain vanilla scenario.
Plain Vanilla – all UTF-8
First you create a database connection in your client, then you set the encoding within the connection to UTF-8 (remember the "SET NAMES" query). If you type "Hello" now and use an insert statement to send it to the database, the characters of this string are converted into bits – according to their encoding – and given to the connection. The bits for "Hello" are 01001000 01100101 01101100 01101100 01101111 which can shortly be written as the hexadecimal 48 65 6C 6C 6F. Within the connection these bits are converted to the characters according to "SET NAMES", in this case UTF-8. Of course the result here is "Hello". These characters are delivered to the database, which again translates them back into bits and saves them. When doing a select on this we get back the word "Hello".
Now, let the games begin. For this to being fun at all, we need to switch to a word that has actually the potential to produce mojibake. I'd like to go with "Beyoncé" as in http://mysql.rjweb.org/doc.php/charcoll. The binary for "Beyoncé" is 01000010 01100101 01111001 01101111 01101110 01100011 11000011 10101001 and the hexadecimal is 42 65 79 6F 6E 63 C3 A9. The character with potential to make some trouble is the "é", which is a 2-byte character (C3 A9).
Mojibake #1:
data: UTF-8, connection: latin-1, database: latin-1
In this case let's assume we did not care at all about encoding and created a database/table with no encoding specification, so it has been created with the default encoding latin-1. Also I do not specify in my client via "SET NAMES", which encoding to use. So here also latin-1 is assumed. My client (i.e. PyCharm) is set to UTF-8 encoding. What happens?
As before the characters "Beyoncé" are translated into bits and given to the connection. Within the connection the bits are now translated using the wrong encoding, namely latin-1. Latin-1 is a pure 1-byte encoding, so it recognises the C3 A9 as two characters, which in latin-1 are à and ©. So the characters "Beyoncé" are sent to the database and appear as such when we're selecting them.
Mojibake #2:
data: UTF-8, connection: latin-1, database: UTF-8
In this situation our client still is set to UTF-8 and our database/table is set to UTF-8 encoding as well. Only in the connection we either specify the wrong encoding or the wrong encoding is assumed due to not specifying it at all, which is again latin-1.
As before the characters "Beyoncé" are translated into bits and given to the connection.
Within the connection the bits are translated into latin-1 characters resulting in "Beyoncé" delivered to the database. Now the database thinks UTF-8 and so does translate these latin-1 characters into the corresponding UTF-8 bits. à being translated to C3 83 and © to C2 A9.
With that the database saves 42 65 79 6F 6E 63 C3 83 C2 A9, two bytes more than our original.
No Mojibake:
data: latin-1, connection: latin-1, database: UTF-8
I'd now like to play through an example, where everything is done right. Imagine I got some data somewhere from the web - which magically again consists of the word "Beyoncé" - and the encoding is (correctly) assumed to be latin-1 this time. The latin-1 hex for "Beyoncé" is 42 65 79 6f 6e 63 e9. So the correct latin-1 translation for é is e9 (1110 1001). We tell the connection that the data I'll be sending will be latin-1 by "SET NAMES latin1;". Over the wire the 42 65 79 6f 6e 63 e9 is correctly translated back to "Beyoncé" and given to the database. The database thinks UTF-8 and translates the characters to UTF-8 bytes for saving, which is 42 65 79 6F 6E 63 C3 A9. Selecting the inserted word, we'll receive "Beyoncé". Beautiful.
Best Practice
- Make use of SET NAMES.
- Set your database / table encoding manually, and try not to mix.
- Use UTF-8 everywhere. In MariaDB and MySQL using UTF-8 means actually UTF-8mb4, which I will explain in the next section. UTF-8 has been designed to solve the encoding mess the internet brought us into. If you'd like to learn a bit ;) more about UTF-8, here's a video I can recommend: Characters, Symbols and the Unicode Miracle by Computerphile
UTF-8 in MariaDB / MySQL - Attention!
Be aware that in MariaDB and MySQL UTF-8 is not valid UTF-8. In MySQL/MariaDB the "UTF-8" option is a 3 Byte encoding. UTF-8 is a 4 byte encoding by definition. So you won't run into any troubles using "UTF-8" as long as you're sure you don't ever need any 4 byte characters. If you happen to try to insert 4 byte characters into a table that is "UTF-8" encoded, you’ll receive an error. To have a table/database in MariaDB/MySQL that is truly UTF-8 encoded, use the utf-8mb4 option.
Problem Solving Techniques
When you come across mojibake in your database, check if best practice has been used. Try to re-evaluate the encoding of your data, meaning you can either check encoding settings of your client (are the settings in your editor correct?) or try another encoding if you feel it might not be the right guess. What is also very helpful is to dig into the database and actually take a look at how the data you inserted was stored.
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1
LENGTH(X)
refers to the number of bytes X needs to be stored.
CHAR_LENGTH(X)
is the number of characters in X.
Example: In UTF-8 I would expect X to have a length of 8 bytes, if X = Beyoncé. If I'd get 10 bytes instead of 8, this would point at mojibake case #2. Meaning that in the connection the encoding has not been set right. This case is called double encoding.
Also taking a peak at which character set and collation are actually specified for a table and its columns can be helpful. For that use the SHOW CREATE TABLE statement.
In any case the SET NAMES
has to agree with the actual encoding of the data.
Conclusion
Whether you're managing a small personal project or a large database for your company, a solid grasp of how encoding works will help you to prevent encoding issues and tackle them more efficiently if they occur. I hope this article was helpful for you to get a more profound understanding of encoding in MariaDB and MySQL. See https://mysql.rjweb.org/doc.php/charcoll by Rick James for more details and a deep dive into the topic.