Character Confusion

13 Dec 2013

Living in Norway and having a last name with plenty of norwegian character I am quite used to see address labels and web-pages showing my name with strange characters.

This is what happens when computer systems talk to each other using text data and using different character encoding. This form of character confusion is so common that it even has a name: Mojibake

The letter I received this time was adressed to my name which is Kjetil Østerås however the label read Kjetil Řsterĺs. There is clearly a computer involved when you see errors like this, no human would do this mistake.

Analysis

Being curious, I would like to know what type of character translation was done inside their computer system in order to produce this output. So lets get out the reverse engineering gloves. This should be quite easy since we know both the input and the output.

The first thing we know is that there is a 1-to-1 byte conversion error here, that means that utf-8 character conversion is out of the picture. The norwegian letters are all encoded as 2 bytes using utf-8, while here we are dealing with a computer system that use a single byte for each characters.

Now we can look at the character Ř. This is a Czech character and has the unicode code point 344. So we can print this character in python3 using this command.

>>> chr(344)
'Ř'

The system we are looking at is not using unicode so we can try to encode this character using a different character set. For instance the Windows-1250 character encoding.

>>> chr(344).encode('Windows-1250')
b'\xd8'

Now we know that Ř is encoded as 0xd8 using the character set Windows-1250, so now we can try to use other character sets to decode the byte 0xd8. Lets try the 'latin-1' character set since it is common to use in scandinavia.

>>> b'\xd8'.decode('latin-1')
'Ø'

Nice, so we know that in the computer systems that processed my name they are using latin-1 as the input character set and the output is using the Window-1250 character set.

We can create a model of the character flow in their computer system using these python3 lines.

>>> 'Kjetil Østerås'.encode('latin-1').decode('Windows-1250')
'Kjetil Řsterĺs'

Pretty neat. This character confusion is bad but it's nothing compared to the more common error when rendering my name. Can you guess what encoding is used as input and output in this example Kjetil Østerås ?

Conclusion

When creating a computer system that has anything to do with text, please be contious about character encoding and use Unicode and utf-8 encoding.