Python String Surprise

28 Apr 2018

Working with Python 2 and strings can be a bit surprising sometimes. I will in this post describe one of the surprises that I encountered in a python 2 script inside an internal test system.

This test system reads in the standard output of a process, it parses the output and produces a JUnit XML file with test reports. This test system was pretty stable, and then suddenly it started to fail. The python failure reported was something like this.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

This seemed strange because the system normally only handled US ASCII characters, however some error in the test reporting was causing an è character to be part of the input to the python script and this è character was causing the python script to misbehave. This had to be investigated some more.

From the error message we can see that the script is trying to interpret a byte with the value of 0xc3 into ASCII. è is encoded as 0xc3 0xa8 in UTF-8 so this means that somewhere in the script there is UTF-8 encoded string which is being converted to ASCII for some reason.

So I started to create small python code snippets to try to see under what circumstances 0xc3 0xa8 will trigger an error. My first attempt was to try the common string operations like concat and printing.

>>> s = '\xc3\xa8'
>>> s = 'hello' + s
>>> print(s)
helloè

But none of these operations trigger any conversion so this was not the issue. So I though that some of the file operations might trigger this error. So I tried to reproduce the error by reading and writing to a file.

>>> f = open('tmp.txt', 'w') 
>>> f.write(s)
>>> f.close()
>>> f = open('tmp.txt', 'r')
>>> f.read()
'hello\xc3\xa8'

Still no error. So I started to read up on python Unicode handling on https://docs.python.org/2/howto/unicode.html. This web page contains the same error that I was seeing in the script. So I tried the code from the Unicode documentation in my test case.

>>> unicode('\xc3\xa8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Great, now we see the same error message that we are looking for. However in the script I was investigating there was no visible conversion to unicode. There was no call to the unicode function anywhere in the script. However in the script there was code that concatenated a string, so I started to print out the type of the string while it was being concatenated with different variables. This was when I noticed that the string started as and then at one point it was converted to . So in our script there was one dictionary key that was of unicode string type, and when a unicode string type is concatenated with an ordinary string type then the ordinary string is automatically converted to unicode. This was exactly what we were seeing in the script. So here is a small python snippet that shows the failure.

>>> e = '\xc3\xa8'
>>> s = u'Hello'
>>> type(e)
<type 'str'>
>>> type(s)
<type 'unicode'>
>>> s += e
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Now that we found the reason for the error we need to find out how to solve the issue. One way to solve this was to decode the UTF-8 input to unicode before doing string concatenation.

>>> e = '\xc3\xa8'
>>> msg = e.decode('utf-8', 'ignore')
>>> type(e)
<type 'str'>
>>> type(msg)
<type 'unicode'>
>>> msg += u"Hello"
>>> output = msg.encode('utf-8', 'ignore')
>>> print(output)
èHello

In this way we take control of both the decoding from a byte string into unicode for processing inside the script and we take control of the encoding into UTF-8 when generating the output. So to reduce the amount of surprises in your Python 2 scripts you should always have full control over the string encoding of both input and output, and be aware that Python 2 will automatically convert between str and unicode when they are concatenated.