I've written a script using python in combination with PyPDF2, PIL and pytesseract to extract the text from the first page of the scanned pages of a pdf file. However, when I tried the below script to get the content from the first scanned page out of that pdf file, It throws the following error when reaches the line containing img = Image.open(pdfReader.getPage(0)).convert('L').
Script I have tried so far:
import PyPDF2
import pytesseract
from PIL import Image
pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()
Error I'm having:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
img = Image.open(pdfReader.getPage(0)).convert('L')
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'
How can I make it a go successfully?