A quick look at the pars, lengths, and handicap indices of golf courses in Donegal and surrounds. In this first part we’ll look at how to get hold of the data by some web scrapping, a second post will do some analysis of the numbers.
Amongst other things, Donegal is a haven for golfers. And while my favourite knock is around the home course in Dunfanaghy, I thought it worthwhile to explore the course cards for the different clubs in the county.
Getting the data from each golf club would be laborious, but
fortunately the website at Golfpass makes this a lot easier.
As long as you can find the landing page for the golf courses you’re
interested in (for Donegal it’s here)
then you can strip out all the links and navigate to the courses, and
their scorecards, that way. The libraries we need are
tidyverse
(of course) and rvest
for the
scraping. We end up with a lot of links, but the ones we care about are
recognisable for the strings /courses/ and
#write-review
library(tidyverse)
library(rvest)
url_donegal <- "https://www.golfpass.com/travel-advisor/course-directory/8812-county-donegal/"
w <- read_html(url_donegal)
scorecard_urls <- html_attr(html_nodes(w, "a"), "href") |>
as_tibble() |>
filter(str_detect(value, "/courses/"),
!str_detect(value, "#write-review")) |>
distinct()
The first few entries are shown below, there are 27 in total.
value |
---|
https://www.golfpass.com/travel-advisor/courses/19794-ballybofey-and-stranorlar-golf-club |
https://www.golfpass.com/travel-advisor/courses/19705-ballyliffin-golf-club-glashedy |
https://www.golfpass.com/travel-advisor/courses/19706-ballyliffin-golf-club-old |
https://www.golfpass.com/travel-advisor/courses/39809-ballyliffin-golf-club-the-pollan-links |
https://www.golfpass.com/travel-advisor/courses/25762-buncrana-golf-club |
https://www.golfpass.com/travel-advisor/courses/19707-bundoran-golf-club |
Navigating through this list of url’s and stripping out the details
for each hole on each course is a job for purrr
. I wrote a
function that works for one course at a time and then ran it through
map
. Twice. Then I put all the courses together using
data.table::rbindlist
.
Some things to watch out for:
some courses had an entry in Golfpass but no scorecard. Hence
the use of safely
so that they didn’t crash the first call
to map
.
the regex
set-up in the function is fundamentally
geared for 18 hole courses but then needed to be adapted to cater for
their 9 hole brethren. Dropping NA’s and then checking the length of
each tibble in the list did this.
having distances measured in either meters or yards was a real pain. Especially for courses that gave an overall length in yards but then each hole in meters. Or vice-versa. Summing up the lengths of the holes and then comparing to the total course length was my best work-around for this.
some of the figures from Golfpass were patently wrong. The par-5 11\(^{th}\) hole in Derry, for example, is obviously longer than 311 yards. I didn’t try and correct for these.
courses have different tee-boxes, championship, society, ladies, and juniors for example. I just took the championship figures, in general the longest distance tee-box for each hole.
The function that works course-by-course is shown below.
course_card <- function(url) {
# use the url to work out the golf club name
course <- sub(".*courses/\\d+-", "", url) |>
str_replace_all("-", " ") |>
str_to_title()
# get raw data from the url using rvest
w <- read_html(url)
pathway_data_html <- html_nodes(w, "td")
card <- html_text(pathway_data_html)
# get distance units (m or yds) for the total length, and the total length itself
unit <- card[3] |> str_extract("[a-z]+")
total_length <- card[3] |> str_extract("\\d+") |> as.numeric()
# work out the lengths for each hole
lengths <- card[(which(card |> str_detect(": \\d\\d\\.\\d"))[1]+1):(which(card |> str_detect(": \\d\\d\\.\\d"))[1]+20)]
# work out the handicap indices for each hole
indexs <- card[(which(card |> str_detect("Handicap$"))+1):(which(card |> str_detect("Handicap$"))+20)]
# work out the pars (3, 4, or 5) for each hole
par <- card[(which(card |> str_detect("Par"))+1):(which(card |> str_detect("Par"))+20)]
# put all this together into a tibble
my_card <- tibble(course = rep(course, 18),
hole = 1:18,
par = c(par[1:9], par[11:19]) |> as.numeric(),
length = c(lengths[1:9], lengths[11:19]) |> as.numeric(),
index = c(indexs[1:9], indexs[11:19]) |> as.numeric()
)
# correct for 9 hole courses
my_card <- my_card |> drop_na()
if(dim(my_card)[1] < 18) {
my_card = filter(my_card, hole<=9)
}
hole_lengths <- sum(my_card$length)
if(dim(my_card)[1] == 9) {
hole_lengths = 2 * hole_lengths
}
# total length given in yards but holes in meters
if(total_length > hole_lengths * 1.05){
my_card <- my_card |>
mutate(unit = "meters",
yards = length * 1.09361)
}
# total length given in meters but holes in yards
if(total_length < hole_lengths * 1.05){
my_card <- my_card |>
mutate(unit = "yards",
yards = length)
}
my_card |> select(course, hole, par, length, unit, index, yards)
}
And the calls to map
to create the overall data frame
are shown here.
Finally, we’ll do a little cleaning. We know these are all golf courses, so we can skip that part of the name. Also, for nine hole courses, players tend to go around twice, so we’ll double up their holes and adjust the hole indices accordingly.
donegal_golf <- donegal_golf |>
mutate(course = course |> str_remove("Golf") |>
str_remove("Links") |>
str_remove("Club") |>
str_remove("And") |>
str_remove("Hotel") |>
str_remove("International") |>
str_squish())
course_summary <- donegal_golf |>
group_by(course) |>
summarise(total_length = sum(yards),
total_par = sum(par |> as.character() |> as.numeric()),
highlight = ifelse(n() < 10, "nine", "eighteen")) |>
ungroup()
course_summary <- course_summary |>
mutate(highlight = ifelse(course == "Dunfanaghy", "yes", highlight))
nines <- course_summary |>
filter(highlight == "nine")
nine_cards <- donegal_golf |>
filter(course %in% nines$course)
nine_type <- nine_cards |>
group_by(course) |>
summarise(index_type = sum(index)) |>
ungroup()
nine_cards <- nine_cards |>
left_join(nine_type) |>
mutate(index = case_when(index_type == 81 ~ index+1,
index_type == 90 ~ index-1,
index_type == 45 ~ index*2),
hole = hole + 9) |>
select(-index_type)
otway_style_indices <- nine_type |> filter(index_type == 45) |> pull(course)
donegal_golf <- donegal_golf |>
mutate(index = ifelse(course %in% otway_style_indices, index*2 - 1, index)) |>
bind_rows(nine_cards)
This gives a tibble that looks like this:
Now we have our data the way we want, we can look and see what this tells us. That’s for the next post
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/eugene100hickey/fizzics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Eugene (2022, Aug. 3). Euge: Golf Courses in Donegal - Part I, data access. Retrieved from https://www.fizzics.ie/posts/2022-08-03-golf-courses-in-donegal/
BibTeX citation
@misc{eugene2022golf, author = {Eugene, }, title = {Euge: Golf Courses in Donegal - Part I, data access}, url = {https://www.fizzics.ie/posts/2022-08-03-golf-courses-in-donegal/}, year = {2022} }