First Steps on Web Scraping with Python

in Programming & Dev2 years ago (edited)


Image Source

I decided to make a simple program to gather PhD positions on Data Science, Data Analytics, Machine Learning and Computer Vision from the following website:

In the end, the code showed to be helpful and save some time to find positions within pre-defined characteristics. However, I would like to optimize the computational cost, and make it faster, so please feel free to write in the comment section any modifications to the code.

First I have started by importing pandas (to manage Data Frames), BeautifulSoup (for web scrapping) and urllib.request (An Extensible library for opening URLs):

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

Next, I needed to catch the URLs for each page:

n_pages = 200
URLS = []
for root in range(n_pages):
    index = '' + str(root + 1)

URLS.insert(0, URL_MAIN)

With n_pages, I can decide how much pages I want to iterate, the more I choose, the greater the computational cost. This function gives an array with all the URLs to be parsed.

The best option I found to visualize and analyze the data, was by making a Data Frame, with the subjects from the URLs that I considered more crucial. Those subjects are ['Date', 'Topics', 'Deadlines', 'Fields', 'Locations', 'Institutes'].
The first lists to be captured from the URLs are both the date's list and topic/position's list:

foo_topics = []
foo_dates = []
for URL in URLS:
    req = urllib.request.Request(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"})
    page = urllib.request.urlopen(req)
    soup = BeautifulSoup(page, "html.parser")
    topics_html = soup.find_all('h2')
    dates_html = soup.find_all('div', class_="value iblock")
    topics = []
    dates = []
    for i in range(len(topics_html)):
        topic = topics_html[i].text # returns only the text inside the HTML component
    for i in range(len(dates_html)):
        date = dates_html[i].text
    topics = topics[2:] # First 2 elements are not job positions

The above function, will iterate over all HTMLs and return the lists mentioned above. Dates are called with a HTML class, and the Topics with a HTML header (h2). The same is done for the other elements of the Data Frame:

foo_deadlines = []
foo_fields = []
foo_locations = []
foo_institutes = []
for URL in URLS:
    req = urllib.request.Request(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"})
    page = urllib.request.urlopen(req)
    soup = BeautifulSoup(page, "html.parser")
    classes_html = soup.find_all('div', class_='col-xs-12 col-sm-7 col-md-6 value')
    deadlines = []
    fields = []
    locations = []
    institutes = []
    for i in range(int(len(classes_html)/4)):
        deadline = classes_html[i*4].text
        field = classes_html[i*4 + 1].text
        location = classes_html[i*4 + 2].text
        institute = classes_html[i*4 + 3].text

All these "foo" lists above, needed to be modified, since there were lists inside the main lists:

all_topics = []
all_dates = []
all_deadlines = []
all_fields = []
all_locations = []
all_institutes = []
for arr in foo_topics:
    for i in arr:
for arr in foo_dates:
    for i in arr:
for arr in foo_deadlines:
    for i in arr:
for arr in foo_fields:
    for i in arr:
for arr in foo_locations:
    for i in arr:
for arr in foo_institutes:
    for i in arr:

With the proper lists obtained, I was now ready to make the Data Frame:

df = pd.DataFrame(list(zip(
    columns=['Date', 'Topics', 'Deadlines', 'Fields', 'Locations', 'Institutes'])

Having the Data Frame, it was now easier to filter it according to the inputs. For this, I decided to create a function. This way, the code can be used to search for different subjects inside the Data Frame:

def get_df(df, subject):
    data = df['Topics'].str.find(subject)
    data_index = data[data  != -1]
    data_index = data_index.index.values.tolist()
    result = df.iloc[data_index, :]
    return result

Finally I use the above function to obtain the Data Frames of the overall PhD positions and PhD positions on Machine Learning, Computer Vision, Data Science and Data Analysis:

df_phd = get_df(df, 'PhD')
df_phd = df_phd.reset_index().drop(columns='index')
df_ml = get_df(df_phd, 'Machine Learning')
df_cv = get_df(df_phd, 'Computer Vision')
df_data = get_df(df_phd, 'Data')

See below the output for df_ml['Topics']:

PhD position on Hybrid Data-Assimilation and Machine Learning                                     
PhD position (m/f/d) Correlative Microscopy and Machine Learning                                  
PhD Position in Advanced Seismic Imaging and Machine Learning for Earthquake Site Response Studies
PhD Project: Efficient Machine Learning Algorithms for Large Scale Data-center Monitoring         
IC2021_08 PhD Fellowship in Trustworthy Machine Learning                

The full code is available in this post, but if you want to see more of my work, feel free to visit me on GitHub:

🤖 🤖


Congratulations @macrodrigues! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s) :

You got more than 10 replies.
Your next target is to reach 50 replies.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

 2 years ago Reveal Comment
 2 years ago Reveal Comment