Skip to content Skip to sidebar Skip to footer

Python- Unwanted Unicode Characters In Printing/extracting Text From Pdf

I am using Python 3.5.2/ Anaconda 4.1.1 to extract text from a pdf: (http://www.mitpressjournals.org/doi/pdf/10.1162/INOV_a_00153) using pypdf2. I am getting many of these unicode

Solution 1:

You could encode text in ASCII and ignore non-ASCII characters.

Try changing:

text=pageObj.extractText().encode('utf-8')

To:

text=pageObj.extractText().encode('ascii', 'ignore')

I've skimmed the output and it seems to have done the trick.

On a separate point, the range in your for loop is causing you to miss some of the output (unless that's what was intended).

Change for a in range(1,num): to for a in range(0,num):

Post a Comment for "Python- Unwanted Unicode Characters In Printing/extracting Text From Pdf"