Web Scraping & Extracting The Bundesliga German Football Table In R Programming
Hi there. In this post, I have a bit of a technical guide for web scraping and extracting the Bundesliga German Football table in the R programming language.
Full R code can be found here for those interested.
I use RStudio on top of the R software for programming in R and for writing up posts offline.
For the libraries I use
rvest. dplyr and tidyr are comparable to pandas from Python programming and rvest is comparable to BeautifulSoup from Python.
Installation of R packages requires the use of
install.packages(). To install the rvest package in R, use
For loading packages in R, use
read_html() to load in a link of a website where you want to extract data from.
# Reference: https://stackoverflow.com/questions/45450981/rvest-scrape-2-classes-in-1-tag library(dplyr) library(tidyr) library(rvest) bundesliga <- read_html("https://www.bundesliga.com/en/bundesliga/table")
The item of interest here is the table from the Bundesliga table link. In Brave Browser, right click to display a pop-up menu. Click on inspect on the bottom to see the HTML code. Make sure that Elements is selected for the tab. The Bundesliga league table is in
In R & rvest, the tbody table can be extracted with the use of
. I usebundesliga %>% html_elements('tbody')
which operates the same ashtml_elements(bundesliga, 'tbody')
. The pipe operator%>%` will be used throughout here.
page <- bundesliga %>% html_elements('tbody')
Using html_nodes From rvest
I have found
html_nodes to be super helpful here when it came to extracting the columns of the league table. Inspect is used heavily in order to extract the Bundesliga teams, ranks, league points, wins, draws, losses, goals and goal difference.
For extracting the teams in the form of FC Bayern München, Borussia Dortmund, Bayer 04 Leverkusen, etc., use the class of
d-none d-lg-inline from the span tag in each table row (tr). Here is the code chunk for extracting teams.
# Teams are in the class 'd-none d-lg-inline': teams <- page %>% html_nodes("[class='d-none d-lg-inline']") %>% html_text2()
Rank can be obtained from the td tag that is the table row (tr) tag for each team. The class is rank.
# Rank & convert into integer: team_rank <- page %>% html_nodes("[class='rank']") %>% html_text2() %>% readr::parse_integer()
For the rest you can use the same code chunk and change the class. Full code can be found here.
- Matches Played: Class is
- Points is
ptsas the class.
- Wins is from the class
d-none d-lg-table-cell wins
d-none d-lg-table-cell winsis the class for draws
- Losses is with
d-none d-lg-table-cell looses. The spelling mistake is in their HTML.
- Goals is with the class
d-none d-md-table-cell goals.
differenceis the class for Goal Difference.
Creating A Dataframe
Once the parts are extracted, you can create a dataframe in R.
### Create Bundesliga dataframe: bundes_df <- data.frame(Rank = team_rank, Team = teams, Points = points, Played = matches, Wins = wins, Draws = draws, Losses = losses, Goals = goals, Goal_Difference = goal_diff)
The Goals column combines Goals For and Goal Against in the format of GF:GA. This can be separated with the use of
separate() from R's tidyr package.
I also save the current Bundesliga table into a .csv file with the current date.
## Goals Column Separate Into Goals For and Goals Against: bundes_df <- bundes_df %>% separate(Goals, c("Goals For", "Goals Against"))
Saving Into A .csv File
Saving a dataframe in R into .csv file is quite easy. You just need
write.csv() and include a dataframe and file name.
I save the current Bundesliga table into a .csv file along with the current date. R's
paste() function allows for concatenating/combining strings together.
## Save Bundesliga Table Into A .csv file. write.csv(bundes_df, paste("Bundesliga_", Sys.Date(), sep = ""))
Posted with STEMGeeks