Skip to content Skip to sidebar Skip to footer

Python Unreproducible Unicodedecodeerror

I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when

Solution 1:

The problem is mixing Unicode and byte strings. Python 2 "helpfully" tries to convert from one to the other but defaults to using the ascii codec.

Here's an example:

>>>'aeioĆ¼'.replace('a','b')  # all byte strings
'beio\xfc'
>>>'aeioĆ¼'.replace(u'a','b') # one Unicode string and it converts...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 4: ordinal not in range(128)

You mentioned reading a UUID from JSON. JSON returns Unicode strings. Ideally read all text files decoding to Unicode, do all text processing in Unicode, and encode text files when writing back to storage. In your "larger framework" this could be a big porting job, but essentially use io.open with an encoding to read a file and decode to Unicode:

with io.open(fname, 'r', encoding='utf8') as fd:
    contents = fd.read().replace(search, replace)

Note that encoding should match the actual encoding of the files you are reading. That's something you'll have to determine.

A shortcut, as you've found in your edit, is to encode the UUID from JSON back to a byte string, but using Unicode to deal with text should be the goal.

Python 3 cleans up this process by making strings Unicode by default, and drops the implicit conversion to/from byte/Unicode strings.

Solution 2:

Change this line:

withopen(fname, 'r') as fd:

to this:

withopen(fname, 'r', encoding='latin1') as fd:

The ascii coded can handle character codes between 0 and 127 inclusive. Your file contains the character code 0xc3, which is outside the range. You need to choose a different codec.

Solution 3:

All the times I've had a problem with special characters in the past I have resolved them by decoding to Unicode when reading and then encoding to utf-8 when writing back to a file.

I hope this works for you too.

For my solution I 've always used what I found in this presentation http://farmdev.com/talks/unicode/

So I would use this:

defto_unicode_or_bust(obj, encoding='utf-8'):
    ifisinstance(obj, basestring):
        ifnotisinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

Then on your code:

contents = to_unicode_or_bust(fd.read().replace(search, replace))

And then when writing it set encoding back to utf-8.

output_zip.writestr(entry, contents.encode('utf-8'))

I didn't reproduce your issue so it's just a suggestion. Hope it works

Post a Comment for "Python Unreproducible Unicodedecodeerror"