PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module

in GEMS2 months ago


I love Python! It is easy to learn. It is fun to use. But most importantly it saves time. Time is the most precious asset we all have. Often we spend out time doing repetitive work over and over again. Computers are really good at doing repetitive work and they do it in more efficient manner. To tell computers do things we need a way to communicate with them. This is where programming languages come in. Learning any programming language is a big task. Python makes learning how to code easy and accessible to anybody with a little effort. With right right mindset anybody can learn basics of python to use it for daily repetitive tasks and ultimately save time.

Python has a big community and many many libraries available to tackle various tasks. One of the python module's I have been using lately is pdfplumber. As the name suggests this module works with pdf files and helps with extracting relevant data.

PDF is a one of the widely used documents formats. If your business, work, and school activities involve any documents, chances are you are familiar with pdf files. What if your daily activities involve reading through large amounts of pdf documents with many many pages? Over time we can get more efficient and effective with how we process these documents manually. But we still have physical limitations and do end up spending countless hours on such repetitive tasks.

Using pdfplumber we can tell the computer to do the repetitive parts of the task, identifying what is needed, extracting relevant data, and maybe even use this data to further analysis or storing for future use and comparison. This is not the only module that helps with extracting data from pdf files. There are many more solutions out there. I found this one to be the easiest to understand and use. And it just works. If you know of any better solutions, feel free to let me know in the comments.

pdfplumber has a great documentation and has examples to demonstrate how it works. Please visit pdfplumber GitHub page for the details.

The most important feature I have been using is extracting text from pdf files. This can be accomplished as following:

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    pages = pdf.pages
    first_page = pages[0]


pdf.pages in the code above returns the list of all pages. This will be a list of page objects. Using properties like '.page_number', '.width', '.height' we can get these self-explanatory values. '.chars' returns a list of all characters used in the page. It has many useful properties as well. This can be used for more complex data extraction. I will share more about '.chars' a bit later.

What makes pdfplumber awesome and super easy to use is its line by line text extraction. Take a look at the following code.

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    pages = pdf.pages
    for page in pages:
        text = page.extract_text().split('\n')

This codes read the pdf file, stores pages in a pages variable. Then we iterate through pages and extract text for each page. We split the extracted text and get a list of text for each line of text. If we know what documents we are working with we can identify certain text patterns to keep the text we need and throw away not needed ones.

Since the text lines are already in order as they appear in the document, this helps us in building a more useful code based on what text appears after certain text patterns. This line by line text extraction function of pdfplumber while may seem very simple, is very powerful and saves me a lot of time.

If you want to build more complex algorithms in extracting data you need, .chars property of the page can be very helpful. It takes a character at a time and provides a lot of information about the character like the value, font, size, x and y locations on the page, etc. To see the full list of .char visit the GitHub link above and/or experiment in your code.

This module can also extract various other objects in a pdf file like lines, rectangles, curves, annotation, and images. They all have similar properties like the char object. Moreover, pdfplumbler can also help with table extraction and has visual debugging feature.

If you work with pdf files a lot and use python, give this module a try. I hope it can help you automate some tasks and save time as well. If you already use it, let me know about your experience with the module in the comments.


Learning Python is on my list once I am done with react and nodejs, so I'm saving this post for later (right now I am pretty sure I am just going to be confused and lonely) and I'll be back in around one month (hopefully since I am putting in like 8-10 hours a day to learn to code) to see what's this all about :D

I remember seeing that you were learning javascript. I always wanted to learn react too. That is awesome. When you get a chance you should look into threejs. Looking forward to seeing some cool apps from you.

I'm still there and damn, I'm loving every step of the way although I'm getting a little too obsessed with progress and some days I go on for too long without breaks, so I gotta pace myself.
I will definitely check threejs (never heard of it). I hope that at some point of early 2022 I am able to start developing, if so, you are definitely on the list of hivers I'll tell before release :D

It has been always a mess to copy text from a PDF file to Word or Notepad. It seems it will be easy to do with Python.

Cool. Looks like this ranks higher then PyPDF2.
Thanks for the information.

I was going to try pypdf2 next. Haven't tried it yet.

I kept saying to myself that I should start learning coding and especially Python. Everything looks so easy when it's explained by someone else but when it comes your turn, things are different. 🙄

You can do it.

Sounds like a handy tool thanks for sharing very informative have the best day

Thanks updating on stuff like this it's really awesome. We live in a world were technology had gone viral with the essence of making work easier and faster for us to handle and it's also nice knowing about the python programming

This post has been manually curated by @bhattg from Indiaunited community. Join us on our Discord Server.

Do you know that you can earn a passive income by delegating to @indiaunited. We share 100 % of the curation rewards with the delegators.

Here are some handy links for delegations: 100HP, 250HP, 500HP, 1000HP.

Read our latest announcement post to get more information.


Please contribute to the community by upvoting this comment and posts made by @indiaunited.

Congratulations @geekgirl! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):

You published more than 550 posts.
Your next target is to reach 600 posts.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!