Data wrangling tricks from the R4DS slack

[This article was first published on R | Discindo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Every now and then there is a question on the R4DS
slack that produces a large number of
replies. This case was particularly interesting because the user that
had the issue deleted all the original content, the question, the
clarifications and so on.

I thought it would be interesting to document this, because it may be
useful to other people.

The question was posted with regards to some genes data, and the task
was to filter out some of the genes that have missing values. But
generally speaking the question can be summarized like this:

How to filter a dataset in a such way that all observations from a group are
dropped if the group doesn’t meet certain criteria?

These are our sample data:

dat <- data.frame(
groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
features = c("c", "d", "e", "e", "e", "d", "d", "f", "g"),
col1 = c(1, 2, 5, NA, 5, NA, 6, 7, NA),
col2 = c(1, 2, 4, 4, NA, 5, 6, NA, 7),
col3 = c(1, 3, 4, NA, 5, NA, 4, 7, 8)
)
dat
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
## 4 b e NA 4 NA
## 5 b e 5 NA 5
## 6 b d NA 5 NA
## 7 c d 6 6 4
## 8 c f 7 NA 7
## 9 c g NA 7 8

We want to keep only group a since it is the only one where none of the
observations have missing data.

There is a case of complete observations in group c as well, but it is only
one (on row 7), so we don’t want to keep that group.

A simple pipeline to do this would be:

library(dplyr)
library(tidyr)
dat |>
drop_na() |>
group_by(groups) |>
filter(n() == 3)
## # A tibble: 3 × 5
## # Groups: groups [1]
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

After providing this solution, I thought about the different scenarios where a similar recipe may be needed. For example, what if the data is too big and, data.table would be preferred? Or what if the pipeline should be part of a bigger project using python for example?

More R

So here are the possible solutions with other libraries. First data.table.

library(data.table)
dat_dt <- setDT(dat)
dat_grp <-
dat_dt[complete.cases(dat_dt),][, .N, by = groups][N == 3]
dt_result <- dat_dt[dat_grp, on = "groups"]
dt_result
## groups features col1 col2 col3 N
## 1: a c 1 1 1 3
## 2: a d 2 2 3 3
## 3: a e 5 4 4 3

Most of the time I find data.table unreadable, and luckily, a tidytable solution is also possible.

dat |>
tidytable::drop_na() |>
tidytable::group_by(groups) |>
tidytable::filter(dplyr::n() == 3)
## # A tidytable: 3 × 5
## # Groups: groups
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

Python

On to python. Here is a simple chain in pandas that does the same.

import pandas as pd
dat_py = r.dat
(dat_py.dropna(axis = 0, how = "any")
.groupby("groups")
.filter(lambda x: len(x) == 3))
## groups features col1 col2 col3
## 0 a c 1.0 1.0 1.0
## 1 a d 2.0 2.0 3.0
## 2 a e 5.0 4.0 4.0

The interesting bit is that drop_na() and dropna() behave
differently. In
tidyr, the
function assumes that it is checking all of the columns by default,
while in
pandas
arguments have to be called to check if any of the columns have missing
values.

SQL

And, how about some SQL? It is possible, of course, that your data may be in a database and not in a flat file. Here is a sqlite solution.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "dat", dat)
query <- "SELECT t1.* FROM dat AS t1
INNER JOIN (
SELECT groups, COUNT(*) as cnt
FROM dat
WHERE col1 IS NOT NULL AND col2 IS NOT NUll AND col3 IS NOT NULL
GROUP BY groups
) AS t2
ON t1.groups = t2.groups
WHERE t2.cnt = 3;"
res <- dbSendQuery(con, query)
dbFetch(res)
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

This was a quick post that provides answers with different tools to the question: How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

We saw how to do it in tidyverse and in data.table. What is missing is base R. Sorry about that! However there are solutions in pandas and SQL. Hopefully something can be useful to somebody.

Data wrangling tricks from the R4DS slack

More R

Python

SQL

Related

Related

Leave a Reply Cancel reply