Evolving notes, images and sounds by Luis Apiolaza

Category: rblogs (Page 1 of 22)

Some love for Base R. Part 4

Following on parts 1, 2 & 3—yes, a series—we arrive to part 4 revisiting Base R. See part 1 for the rationale, in case you’re wondering Whyyyy?

A typical question going back to Base from the tidyverse: How do I join datasets? What do I use instead of bind_rows() and bind_cols()? Easy, rbind() and cbind(), yes, r for rows and c for cols, because base is concise.

By rows

If we have a couple of data frames with the same variables (columns), then using rbind() binds/glues/stitches the data frames one after the other.

example_df1 <- data.frame(record = 1:24,
                          treatment = rep(LETTERS[1:3], each = 8))

example_df2 <- data.frame(record = 25:48,
                          treatment = rep(LETTERS[4:6], each = 8))

example_df3 <- data.frame(record = 49:72)

# This one works
example_bound <- rbind(example_df1, example_df2)

# This one doesn't as they don't have the same variables
example_bound <- rbind(example_df1, example_df3)

# If we redefine the data frame we can join more than two data frames
example_df3 <- data.frame(record = 49:72,
                          treatment = rep(LETTERS[7:9], each = 8))

example_bound <- rbind(example_df1, example_df2, example_df3)

Of course we can use pipes too:

example_df1 |> rbind(example_df2) -> example_bound2

By columns

If we have a couple of data frames with the same number of rows (cases), then using cbind() binds/glues/stitches the data frames side by side.

example_df4 <- data.frame(record = 1:24,
                          treat1 = rep(LETTERS[1:3], each = 8))

example_df5 <- data.frame(treat2 = rep(LETTERS[4:5], 12),
                          meas = rnorm(24))

example_cbound <- cbind(example_df4, example_df5)

   record treat1 treat2       meas
1       1      A      D -2.1158479
2       2      A      E  0.7784022
3       3      A      D -0.0112054
4       4      A      E -0.1986594

When you are working with data frames you get pretty much what you’d expect in dplyr. However, if you are not working with data frames but, instead, you’re dealing with vectors you end up with matrices, in which all elements have the same type. Coercing different types may produce unexpected results

# Binding columns
x <- 1:26
y <- sqrt(x)

example_1 <- cbind(x, y)

# What do we get?
[1] TRUE

       x        y
 [1,]  1 1.000000
 [2,]  2 1.414214
 [3,]  3 1.732051
 [4,]  4 2.000000

# Perhaps unexpected result. Variable x
# was coerced to character
example_2 <- cbind(x, letters)
      x    letters
 [1,] "1"  "a"    
 [2,] "2"  "b"    
 [3,] "3"  "c"    
 [4,] "4"  "d"  

By one or more indices

When you have data frames with one or more variables “in common” the function to use is merge(), which may work like left_join() and right_join() in dplyr.

merge(x, y, by =)
# which you can read as
merge(left, right, by = )

Think of x as left and y as right. Using all.x = TRUE extra rows will be added to the output, one for each row in x that has no matching row in y. Using all.y = TRUE extra rows will be added to the output, one for each row in y that has no matching row in x.

As an example, I have two data frames with a tree id (ids) and a derived variable (first tree ring to achieve a technical threshold for microfibril angle and modulus of elasticity). I would like to join them by ids:

    ids assess
1 DM001      3
2 DM002      5
3 DM003      4
4 DM004      6
5 DM005      5
6 DM006      7

    ids ring
1 DM001    8
2 DM002    8
3 DM003    8
4 DM004    8
5 DM005    9
6 DM006   12

# Merging keeping all observations
gendata <- merge(firstmfa, firstmoe, by = 'ids', all = TRUE)

Another example using more than one joining variable. Actual wood density (in kg/m3) and microfibril angle (in degrees) assessments per tree ring, joined by tree code and ring number

> head(densdataT)
    ids ring density
1 DM001    1      NA
2 DM001    2      NA
3 DM001    3  327.96
4 DM001    4  325.37
5 DM001    5  336.59
6 DM001    6  360.82

> head(mfadataT)
    ids ring   mfa
1 DM001    1    NA
2 DM001    2    NA
3 DM001    3 31.93
4 DM001    4 31.70
5 DM001    5 33.21
6 DM001    6 27.98

assess <- merge(densdataT, mfadataT, by = c('tree', 'ring'), all = TRUE)

Anyone using other than RStudio?

I asked both in Mastodon and Twitter “Anyone using other than #RStudio as their main #rstats IDE?” and—knowing that some programmers are literal and would probably reply ‘Yes’—I added “What is it?”

Of course I got a few replies like “I only have used RStudio” (Why reply?) or “I use RStudio but in docker containers” (Still RStudio). I also received mostly helpful answers, with some of the usual suspects and a more esoteric option:

  • The most popular alternative was Visual Studio Code using the R Extension for Visual Studio Code plugin together with the languageserver R package.
  • Neovim together with the Nvim-R plugin.
  • Emacs (or one of its variants) + ESS (Emacs Speaks Statistics).

On the esoteric side Spacemacs (Emacs + layers of configuration), or an unholy combination of Emacs with Vim keybindings.

Any of these options will let you use Rmarkdown or Quarto if you are into that.

P.S. I focused on crossplatform options (I tend to move a lot), but a comment in Mastodon mentioned a Windows-only option that could be useful for a few people: Notepad++ together with the NppToR utility.

Creating an n x n autocorrelation matrix

Between covid-19 news and announcements of imminent Russia-Ukraine wars I needed a bit of a distraction. Sooo, here it is how to create an n x n autocorrelation matrix based on a correlation rho, with a simple 5 x 5 example in R:

n <- 5
rho <- 0.9
mat <- diag(n)
mat <- rho^abs(row(mat) - col(mat))

This produces the following output:

       [,1]  [,2] [,3]  [,4]   [,5]
[1,] 1.0000 0.900 0.81 0.729 0.6561
[2,] 0.9000 1.000 0.90 0.810 0.7290
[3,] 0.8100 0.900 1.00 0.900 0.8100
[4,] 0.7290 0.810 0.90 1.000 0.9000
[5,] 0.6561 0.729 0.81 0.900 1.0000

How does this work? Starting from defining n as 5, and the basic rho correlation as 0.9, then mat <- diag(n) creates a diagonal matrix—a square matrix with ones in the diagonal and zeros elsewhere—with 5 rows and 5 columns.

Continue reading

The beauty of code vectorisation

I came across this problem in Twitter:

Description of the problem by Fermat’s Library.

The basic pieces of the problem are:

  • We need to generate pseudorandom numbers with an identical distribution, add them up until they go over 1 and report back how many numbers we needed to achieve that condition.
  • Well, do the above “quite a few times” and take the average, which should converge to the number e (2.7182…).
Continue reading

Recreational programming

I think programming, aka coding, is a fun activity. We are solving a problem subject to a set of constraints that can be time, memory, quantity of code, language, etc. Besides being a part of my work, coding is also a good distraction when doing it for the sake of it.

In this type of situation I like to set myself additional constraints. For example, I try to use only base R or a very minimal set of features. Moreover, I prefer vectorised code: there are no explicit loops in the code. For toy problems this restriction doesn’t make much of a difference in execution time, but in real problems R (and Matlab) run much more slowly with non-vectorised code.

Continue reading
« Older posts

© 2024 Palimpsest

Theme by Anders NorenUp ↑