I have continued playing with the tidyverse for different parts of a couple of projects.
Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.
describe_c <- function(x) { mn <- mean(x, na.rm = TRUE) dev <- sd(x, na.rm = TRUE) n <- sum(!is.na(x)) cv <- dev/mn*100 return(c(mean = mn, sdev = dev, count = n, coefvar = cv)) }
If I have a data frame called field_data
, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:
# This line produces an annoying list summary_one <- with(field_data, tapply(stem, family, FUN = describe_v)) # This puts together a matrix by joining # the list results using rbind() summary_one <- do.call(rbind, summary_one) # To continue processing it might be better to convert # to a data frame summary_one <- data.frame(summary_one)
And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.
Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.
One of the keys is that dplyr
can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:
describe_list <- function(x) { mn <- mean(x, na.rm = TRUE) dev <- sd(x, na.rm = TRUE) n <- sum(!is.na(x)) cv <- dev/mn*100 return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv))) }
And we can check the contents of summary_two
to see we have a list in which each element contains 4 values:
head(summary_two) # A tibble: 6 x 2 # family model ## 1 A
# 2 B # 3 C # 4 D # 5 E # 6 F #
We still need to extract the elements of each element of the list and assign them to a variable name. Using map
from the purrr
package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.
summary_two %>% mutate(mn = map_dbl(model,'mean'), sd = map_dbl(model,'sdev'), n = map_dbl(model,'count'), cv = map_dbl(model,4)) %>% head # A tibble: 6 x 6 # family model mn sd n cv #
# 1 A 190.8306 23.71290 425 12.42615 # 2 B 190.1111 25.46554 396 13.39508 # 3 C 188.2646 27.39215 461 14.54981 # 4 D 189.2668 25.16330 431 13.29514 # 5 E 183.5238 19.70182 21 10.73530 # 6 F 183.1250 28.82377 24 15.73994 #
I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:
describe_char <- function(x) { mn <- mean(x, na.rm = TRUE) dev <- sd(x, na.rm = TRUE) n <- sum(!is.na(x)) cv <- dev/mn*100 return(paste(mn, dev, n, cv, sep = ':')) } field_data %>% group_by(family) %>% summarise(model = describe_char(stem)) -> summary_three head(summary_three) # A tibble: 6 x 2 # family model ## 1 A 190.830588235294:23.7128956613006:425:12.4261502731746 # 2 B 190.111111111111:25.4655444116168:396:13.3950847284951 # 3 C 188.26464208243:27.3921487349435:461:14.5498105390125 # 4 D 189.266821345708:25.1632953227626:431:13.2951434085746 # 5 E 183.52380952381:19.7018249094317:21:10.7352963959021 # 6 F 183.125:28.8237711378767:24:15.7399432834822 # summary_three %>% separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head # A tibble: 6 x 5 # family mn sd n cv # # 1 A 190.830588235294 23.7128956613006 425 12.4261502731746 # 2 B 190.111111111111 25.4655444116168 396 13.3950847284951 # 3 C 188.26464208243 27.3921487349435 461 14.5498105390125 # 4 D 189.266821345708 25.1632953227626 431 13.2951434085746 # 5 E 183.52380952381 19.7018249094317 21 10.7352963959021 # 6 F 183.125 28.8237711378767 24 15.7399432834822 #
And we can get all the way there with:
summary_three %>% separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head # A tibble: 6 x 5 # family mn sd n cv ## 1 A 190.8306 23.71290 425 12.42615 # 2 B 190.1111 25.46554 396 13.39508 # 3 C 188.2646 27.39215 461 14.54981 # 4 D 189.2668 25.16330 431 13.29514 # 5 E 183.5238 19.70182 21 10.73530 # 6 F 183.1250 28.82377 24 15.73994
Which I assume has all sort of potential negative side-effects, but looks really cool.
In case you want to play with the problem, here is a tiny example of field data.