[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I’ve been busy with the field exams, so I haven’t had much time to work on the blog.
spuriouscorrelations package started as a fun project for one of my tutorials.
Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.
library(spuriouscorrelations) library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats': filter, lag
The following objects are masked from 'package:base': intersect, setdiff, setequal, union
library(ggplot2) unique(spurious_correlations$var1)
[1] Suicides by hanging, strangulation and suffocation [2] Number of people who drowned by falling into a pool [3] Number of people who died by becoming tangled in their bedsheets [4] Murders by steam, hot vapours and hot objects [5] Computer science doctorates awarded in the US [6] Sociology doctorates awarded in the US [7] Civil engineering doctorates awarded in the US [8] People who drowned after falling out of a fishing boat [9] Drivers killed in collision with railway train [10] Total US crude oil imports [11] Number of people who drowned while in a swimming-pool [12] Suicides by crashing of motor vehicle [13] Number of people killed by venomous spiders [14] Mathematics doctorates awarded 14 Levels: Civil engineering doctorates awarded in the US ...
drownings <- spurious_correlations %>% filter( var1 == "Number of people who drowned by falling into a pool" ) %>% select(year, var1, var2, var1_value, var2_value) cor(drownings$var1_value, drownings$var2_value)
Now let’s plot the data.
# compute a scale factor so that max(var2_value * factor) ≈ max(var1_value) max1 <- max(drownings$var1_value) max2 <- max(drownings$var2_value) ratio <- max1 / max2 ggplot(drownings, aes(x = year)) + geom_line(aes(y = var1_value, color = "Drownings")) + geom_line(aes(y = var2_value * ratio, color = "Films")) + scale_y_continuous( name = "Number of drownings", sec.axis = sec_axis(~ . / ratio, name = "Number of films" ), limits = c(0, NA) ) + scale_color_manual( name = "", values = c( "Drownings" = "blue", "Films" = "red" ) ) + theme_minimal() + labs( title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in", caption = "Source: Spurious Correlations (Vigen 2015)" )
Interested? You can install the package from GitHub
pak::pkg_install("pachadotdev/spuriouscorrelations")
Related