Extract attachments from email with Python

Looking for data to start your new EDA (Exploratory Data Analysis) project? Or maybe just looking to automate a task that is stealing a lot of your precious time?

Extracting data from your email is a very good practice to collect data or to optimize everyday boring tasks.


In this post I will explain how I succeeded to help a human resources task, by using python, more specifically using a library called pywin32.

To install the package, you should do it on windows, otherwise it will prompt an error. Make sure to have a virtual environment in your windows pc and apply the following command:

pip install pywin32

Start by creating an object variable, that will allow to access your email (in this case Outlook):

import win32com.client

outlook = win32com.client.Dispatch('outlook.application').GetNamespace("MAPI")

If you have several accounts, you can use the following function to choose your account:

#check how many outlook accounts there are
def get_email_accounts():
    accounts = []
    for account in outlook.Accounts:
    return accounts

To check all the main folders you can access using the object, use the following function:

#iterate to see main folders
def iterate_folder(iter = 50):
    for i in range(iter):
            inbox = outlook.GetDefaultFolder(i)
            print(i, inbox)

If you created extra main folders, the function above isn't able to detect them, however subfolders inside inbox, or inside any other pre defined folder, can be grabbed by using the following command:

folder = outlook.GetDefaultFolder(6).folders(<subfolder>) 

You might be wondering why I chose '6' in the command above. The number '6' is the default for inbox, then I just accessed a subfolder inside the inbox, the one having the files I wanted to extract.

To grab the messages inside the subfolder use the following commands:

#the last message
messages = folder.Items
message_last = messages.GetLast()

#the next message
message_previous = messages.GetPrevious()

I will explain ahead how to loop over all the messages inside the subfolder, first I will introduce the function below, which basically looks for .pdf files inside a specific message and saves them inside a list and a directory.

def get_attachs_from_message(message, output_dir, index, iter = 4):
    attachments = message.Attachments #object that contains the attachments
    attachments_pdf = [] #empty list
    for i in range(1, iter):
            attach = attachments.Item(i) # object that contains a single attachment
            if '.pdf' in attach.FileName: #checks for pdf files
                attach.SaveASFile(os.path.join(output_dir, f"{index}_{attach.FileName}"))
    return attachments_pdf

Finally to loop over all the messages in the subfolder:

list_of_lists = []
    for i in range(0, 100): # choose how many messages you want to parse
        message = messages.GetPrevious()# gets previous email message
            index = f"0{str(index)}"))


To wrap up, the later two functions extract the pdf files from the subfolder and saves them into a directory. Afterwards I used PyPDF2 to extract important data from the saved pdfs and save it in a .csv file.

Hoping the scripts provided can be helpful for your own needs.

Email is definitely an amazing source of data! 😎