Skip to content Skip to sidebar Skip to footer

Python - Extract Text From Pdf Page-wise To List

I am trying to extract text page wise from a PDF and store text as a list per page in a list like [['This', 'is', 'one', 'page'] , ['I', 'am', 'page', 'TWO'] , ['Three', 'that\'s',

Solution 1:

Well, you could try this:

import PyPDF2

pages = []
pdf_file = <Enter your file path>
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(number_of_pages):   # use xrange in Py2
    page = read_pdf.getPage(page_number).extractText().split(" ")  # Extract page wise text then split based on spaces as required by you
    pages.append(page)

Post a Comment for "Python - Extract Text From Pdf Page-wise To List"