I love Python! It is easy to learn. It is fun to use. But most importantly it saves time. Time is the most precious asset we all have. Often we spend out time doing repetitive work over and over again. Computers are really good at doing repetitive work and they do it in more efficient manner. To tell computers do things we need a way to communicate with them. This is where programming languages come in. Learning any programming language is a big task. Python makes learning how to code easy and accessible to anybody with a little effort. With right right mindset anybody can learn basics of python to use it for daily repetitive tasks and ultimately save time.
Python has a big community and many many libraries available to tackle various tasks. One of the python module's I have been using lately is
pdfplumber. As the name suggests this module works with pdf files and helps with extracting relevant data.
PDF is a one of the widely used documents formats. If your business, work, and school activities involve any documents, chances are you are familiar with pdf files. What if your daily activities involve reading through large amounts of pdf documents with many many pages? Over time we can get more efficient and effective with how we process these documents manually. But we still have physical limitations and do end up spending countless hours on such repetitive tasks.
pdfplumber we can tell the computer to do the repetitive parts of the task, identifying what is needed, extracting relevant data, and maybe even use this data to further analysis or storing for future use and comparison. This is not the only module that helps with extracting data from pdf files. There are many more solutions out there. I found this one to be the easiest to understand and use. And it just works. If you know of any better solutions, feel free to let me know in the comments.
pdfplumber has a great documentation and has examples to demonstrate how it works. Please visit pdfplumber GitHub page for the details.
The most important feature I have been using is extracting text from pdf files. This can be accomplished as following:
import pdfplumber with pdfplumber.open("path/to/file.pdf") as pdf: pages = pdf.pages first_page = pages print(first_page.page_number) print(first_page.width) print(first_page.height) print(len(first_page.chars))
pdf.pages in the code above returns the list of all pages. This will be a list of page objects. Using properties like '.page_number', '.width', '.height' we can get these self-explanatory values. '.chars' returns a list of all characters used in the page. It has many useful properties as well. This can be used for more complex data extraction. I will share more about '.chars' a bit later.
pdfplumber awesome and super easy to use is its line by line text extraction. Take a look at the following code.
import pdfplumber with pdfplumber.open("path/to/file.pdf") as pdf: pages = pdf.pages for page in pages: text = page.extract_text().split('\n') print(len(text))
This codes read the pdf file, stores pages in a pages variable. Then we iterate through pages and extract text for each page. We split the extracted text and get a list of text for each line of text. If we know what documents we are working with we can identify certain text patterns to keep the text we need and throw away not needed ones.
Since the text lines are already in order as they appear in the document, this helps us in building a more useful code based on what text appears after certain text patterns. This line by line text extraction function of
pdfplumber while may seem very simple, is very powerful and saves me a lot of time.
If you want to build more complex algorithms in extracting data you need,
.chars property of the page can be very helpful. It takes a character at a time and provides a lot of information about the character like the value, font, size, x and y locations on the page, etc. To see the full list of
.char visit the GitHub link above and/or experiment in your code.
This module can also extract various other objects in a pdf file like lines, rectangles, curves, annotation, and images. They all have similar properties like the char object. Moreover,
pdfplumbler can also help with table extraction and has visual debugging feature.
If you work with pdf files a lot and use python, give this module a try. I hope it can help you automate some tasks and save time as well. If you already use it, let me know about your experience with the module in the comments.