Python- Unwanted Unicode Characters In Printing/extracting Text From Pdf
I am using Python 3.5.2/ Anaconda 4.1.1 to extract text from a pdf: (http://www.mitpressjournals.org/doi/pdf/10.1162/INOV_a_00153) using pypdf2. I am getting many of these unicode
Solution 1:
You could encode text
in ASCII and ignore non-ASCII characters.
Try changing:
text=pageObj.extractText().encode('utf-8')
To:
text=pageObj.extractText().encode('ascii', 'ignore')
I've skimmed the output and it seems to have done the trick.
On a separate point, the range
in your for
loop is causing you to miss some of the output (unless that's what was intended).
Change for a in range(1,num):
to for a in range(0,num):
Post a Comment for "Python- Unwanted Unicode Characters In Printing/extracting Text From Pdf"