Install Pypdf2 Python

PDF manipulation using PyPDF2

PyPDF2 is Python based library for PDF manipulation. It provides functions to perform PDF splitting, merging, extracting text, etc.

PyPDF2 is a pure Python package, so you can install it using pip (assuming pip is in your system’s path): python -m pip install pypdf2 As usual, you should install 3rd party Python packages to a Python virtual environment to make sure that it works the way you want it to. Extracting Metadata from PDFs. PyPdf was originally written for Python 2, but a Python 3 compatible branch has since been made available. The updated files can be found here, and enable pyPdf to be integrated with Python 3. To update these new Python 3 files with the old Python 2 files, locate the following directory on your system: C: Python32 Lib site-packages pyPdf. Installation pypdf2 is a pure python package, so you can install it using pip (assuming pip is in your system’s path): 1. Python -m pip install pypdf2.

Why?

Before going ahead, we need to find why PDF manipulation is required?.

Sometimes we need to extract the text out of it for Text Processing like NLP, we need to find a number of pages in a given PDF, adding a new page in PDF, etc.

Python

So there are a lot of operations we need to perform on PDFs in order to get our desired result, that is why we need to know how to manipulate or work with PDFs.

In this article, I’ll be focusing on text PDFs only, because extracting text from image PDF (PDF created with text images) is not straight forward, you need to know about Optical Character Recognition mechanism to extract text from image PDFs.

If you are working on image PDFs or interested in Optical Character Recognition (OCR), then go through the following articles.

PyPDF2:

Installation

It’s a python library that can be installed using pip.

Note: I am assuming that you are currently using Python 3.

Reading PDF

Install Pypdf2 Python

Import PyPDF2, and read the PDF file in read binary (rb) mode.

Now we have the file pointer, so to read the file we need PdfFileReader, let’s create it.

Getting the number of pages in PDF.

In PyPDF the page count starts from 0, so fetching 0th page.

Install Pypdf2 Python Windows 10

Now we have page_0 object, so we can extract from 0th page.

For more Reading function checkout PdfFileReader.

Writing PDF

Now we will write something into PDFs.

Pypdf2

Opening PDF in write mode, if the file doesn’t exist it will create a new file.

Now we will write the page which we have fetched in the last section.

Install Pypdf2 Python

Suppose, we want to write all the pages from one PDF to another PDF, then we don’t need to fetch pages one by one, we can add all the pages at once.

Finally, close the files

MergingPDFs

PyPDF2 also provides functionality for merging or contacting 2 PDFs, slicing a PDF.

Creating the PdfFileMerger object

Install Pypdf2 Python Anaconda

Appending 2 PDFs

Saving the final output

For more information checkout PdfFileMerger

Note: Always close the file after performing an operation on it, otherwise error might occur when next time you try to open the file.

Thanks for reading.

If you find any mistake or issue, kindly let me know in the comments.

Motivation

Since I want to work PDF file with Python on my work, I investigate what library can do that and how to use it.

Preparation

The runtime and module version are as below.

  • python 3.6
  • PyPDF2 1.26.0

Install PyPDF2

Yum Install Python3-pypdf2

To work PDF file with Python, PyPDF2 is often used.

PyPDF2 can

  • Extract text from PDF file
  • Work existing PDF file and create new one

Let's install with pip command.

Prepare PDF file

Prepare a new PDF file for working. Download Executive Order in this time.It looks like below. There are three pages in all.

Read PDF file

In this section, Open and read a normal PDF file.Print number of pages in the PDF file in the following sample code.

Open the PDF file as binary read mode after importing PyPDF2.And then, create a PdfFileReader object to work PDF.

Check the result.

Read a PDF file with password(Encrypted PDF)

In this section, Open and read an encrypted PDF file that has a password when opening a file. To create an encrypted PDF file, set a password with enabling encryption option when saving a PDF file.

Failed example

Save a PDF file named executive_order_encrypted.pdf with a password hoge1234.Open the PDF file and execute with the previous code that read the PDF without password.

The following error message will be printed.

Success example

The decrypt function given a password string to an argument decrypts an encrypted PDF file.It is a better way to check if the file is encrypted with isEncrypted function before calling decrypt function.

Troubleshooting: NotImplementedError is thrown in calling decrypt function

The following error message may be thrown when working an encrypted PDF file.

The error message means that PyPDF2 doesn't have an implementation to decrypt an algorithm that encrypts the PDF file.If this happens, it's difficult to open the PDF file with PyPDF2 only.

Decrypt with qpdf

Python

Install Pypdf2 Python Code

Using qpdf is a quick solution.qpdf is a tool to work PDF file on command line interface.We can download its installer for Windows from SourceForge, or install it for Mac with brew install qpdf command.

Install Pypdf2 Python Anaconda

Sample code that qpdf decrypts a PDF file is below.

The point is that Python executes the qpdf command as the OS command andsave decrypted PDF file as new PDF file without password. Then, create PdfFileReader instance to work the PDF file with PyPDF2.

Conclusion

It is available to

Install Pypdf2 Python Download

  • Open PDF file with PdfFileReader on PyPDF2
  • Decrypt an encrypted PDF file with decrypt function
  • Decrypt an encrypted PDF file with qpdf when NotImplementedError is occured