Skip to content Skip to sidebar Skip to footer

Python Kludge To Read Ucs-2 (utf-16?) As Ascii

I'm in a little over my head on this one, so please pardon my terminology in advance. I'm running this using Python 2.7 on Windows XP. I found some Python code that reads a log fil

Solution 1:

codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).

Also, "Unicode In Python, Completely Demystified".

Solution 2:

works.log appears to be encoded in ASCII:

>>>data = open('works.log', 'rb').read()>>>all(d < '\x80'for d in data)
True

breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:

>>>data = open('breaks.log', 'rb').read()>>>data[:2]
'\xff\xfe'
>>>udata = data.decode('utf16')>>>all(d < u'\x80'for d in udata)
True

If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:

f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
    str(x) forx in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart

to this:

f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
ifdata[:2] == '\xff\xfe':
    data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n'for line indata.splitlines())
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines))        )
print mb_toc_urlpart

Solution 3:

Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:

Put this at the top of your Python source file:

from __future__ import unicode_literals

And change all the str to unicode.

[edit]

And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.

Post a Comment for "Python Kludge To Read Ucs-2 (utf-16?) As Ascii"