PDFPLUMBER - PDF Data Extracting Library For Python Gets A Big Update

almost 3 years ago

This is going to be a short post and update on my previous post about PDFPLUMBER. PDFPLUMBER is a super useful python library for extracting data from pdf files. If you have used it before, you will like the new changes it has. If you haven't used it before and you work with a lot of pdf files, you may want to consider adding it to your toolbox. If you want to read more detailed review about this library, feel free to read my previous post about it.

I work with dozens of pdf files with dozens or hundreds of pages on daily basis. I had to write a script to extra needed data and create a summary information. PDFPLUMBER did the job just fine. Today, I wanted to revise to code and came across a very pleasant surprise. The library was updated at the end of last month and now has a really cool feature. This feature is layout option when extracting text. In the previous version, we only could extract text line by line, and this text would remove all whitespaces and wouldn't retain text's layout attributes.

Now, when extracting data we can pass an argument layout=True to .extract_text() method, and the entire page of text extracted with the original layot as they appear on the pdf document. Pretty cool. This makes is easier to identify which text means what and their purpose. Without this we would have to resort to creating regex sorting to get the specific text. I absolutely dislike using regex. lol

If any of this doesn't make any sense, please read my previous post about PDFBLUMBER or go to PDFBLUMBER GitHub page first.

Extracting text data with layout is a new feature, and is not guaranteed to be 100% accurate. I actually did come across one error it made. But for the most part it works great.

page.extract_text().split('\n') extract text the old way and by splitting by line breaks we can get a list of text for each line of text. We can do the same with with the layout option like this:
page.extract_text(layout=True).split('\n')

The first problem we will see is that both ways do not return same amount of lines of text. Even though we are using the same pdf file. The problem is the way with layout adds bunch of empty lines. To fix this we will have to iterate through the result list and remove the empty items of the list.

Once that is done, both with and without layout will have same amount of text lines. The reason we need to make sure that text in both ways matches is to fix the problem when layout presents errors. One error I saw was a word was added to the previous line. This only happened ones. But the result was two lines of text were not accurate or at least not what I would be expecting.

So, next step would be to iterate through both lists with and without layout, identify where lines of text don't match. Once these lines are identified we can count spacing added in front of the layout version of the line, and add same amount of spacing in front of the text from the list without layout. And finally this text line is used to replace the ones with error. It may seem confusing at first, but it actually worked for me. If you come across similar problem, feel free ask. I will be glad to help.

I am really happy for this update and can see a lot of benefits. I spent the entire day revising my previous code to make use of this new change. I will be using this library more often now. Best tool ever!

pdf dev python coding vyb proofofbrain stem