Evolving notes, images and sounds by Luis Apiolaza

Category: r (Page 1 of 20)

Some love for Base R. Part 4

Following on parts 1, 2 & 3—yes, a series—we arrive to part 4 revisiting Base R. See part 1 for the rationale, in case you’re wondering Whyyyy?

A typical question going back to Base from the tidyverse: How do I join datasets? What do I use instead of bind_rows() and bind_cols()? Easy, rbind() and cbind(), yes, r for rows and c for cols, because base is concise.

By rows

If we have a couple of data frames with the same variables (columns), then using rbind() binds/glues/stitches the data frames one after the other.

example_df1 <- data.frame(record = 1:24,
                          treatment = rep(LETTERS[1:3], each = 8))

example_df2 <- data.frame(record = 25:48,
                          treatment = rep(LETTERS[4:6], each = 8))

example_df3 <- data.frame(record = 49:72)

# This one works
example_bound <- rbind(example_df1, example_df2)

# This one doesn't as they don't have the same variables
example_bound <- rbind(example_df1, example_df3)

# If we redefine the data frame we can join more than two data frames
example_df3 <- data.frame(record = 49:72,
                          treatment = rep(LETTERS[7:9], each = 8))

example_bound <- rbind(example_df1, example_df2, example_df3)

Of course we can use pipes too:

example_df1 |> rbind(example_df2) -> example_bound2

By columns

If we have a couple of data frames with the same number of rows (cases), then using cbind() binds/glues/stitches the data frames side by side.

example_df4 <- data.frame(record = 1:24,
                          treat1 = rep(LETTERS[1:3], each = 8))

example_df5 <- data.frame(treat2 = rep(LETTERS[4:5], 12),
                          meas = rnorm(24))

example_cbound <- cbind(example_df4, example_df5)

   record treat1 treat2       meas
1       1      A      D -2.1158479
2       2      A      E  0.7784022
3       3      A      D -0.0112054
4       4      A      E -0.1986594

When you are working with data frames you get pretty much what you’d expect in dplyr. However, if you are not working with data frames but, instead, you’re dealing with vectors you end up with matrices, in which all elements have the same type. Coercing different types may produce unexpected results

# Binding columns
x <- 1:26
y <- sqrt(x)

example_1 <- cbind(x, y)

# What do we get?
[1] TRUE

       x        y
 [1,]  1 1.000000
 [2,]  2 1.414214
 [3,]  3 1.732051
 [4,]  4 2.000000

# Perhaps unexpected result. Variable x
# was coerced to character
example_2 <- cbind(x, letters)
      x    letters
 [1,] "1"  "a"    
 [2,] "2"  "b"    
 [3,] "3"  "c"    
 [4,] "4"  "d"  

By one or more indices

When you have data frames with one or more variables “in common” the function to use is merge(), which may work like left_join() and right_join() in dplyr.

merge(x, y, by =)
# which you can read as
merge(left, right, by = )

Think of x as left and y as right. Using all.x = TRUE extra rows will be added to the output, one for each row in x that has no matching row in y. Using all.y = TRUE extra rows will be added to the output, one for each row in y that has no matching row in x.

As an example, I have two data frames with a tree id (ids) and a derived variable (first tree ring to achieve a technical threshold for microfibril angle and modulus of elasticity). I would like to join them by ids:

    ids assess
1 DM001      3
2 DM002      5
3 DM003      4
4 DM004      6
5 DM005      5
6 DM006      7

    ids ring
1 DM001    8
2 DM002    8
3 DM003    8
4 DM004    8
5 DM005    9
6 DM006   12

# Merging keeping all observations
gendata <- merge(firstmfa, firstmoe, by = 'ids', all = TRUE)

Another example using more than one joining variable. Actual wood density (in kg/m3) and microfibril angle (in degrees) assessments per tree ring, joined by tree code and ring number

> head(densdataT)
    ids ring density
1 DM001    1      NA
2 DM001    2      NA
3 DM001    3  327.96
4 DM001    4  325.37
5 DM001    5  336.59
6 DM001    6  360.82

> head(mfadataT)
    ids ring   mfa
1 DM001    1    NA
2 DM001    2    NA
3 DM001    3 31.93
4 DM001    4 31.70
5 DM001    5 33.21
6 DM001    6 27.98

assess <- merge(densdataT, mfadataT, by = c('tree', 'ring'), all = TRUE)

Anyone using other than RStudio?

I asked both in Mastodon and Twitter “Anyone using other than #RStudio as their main #rstats IDE?” and—knowing that some programmers are literal and would probably reply ‘Yes’—I added “What is it?”

Of course I got a few replies like “I only have used RStudio” (Why reply?) or “I use RStudio but in docker containers” (Still RStudio). I also received mostly helpful answers, with some of the usual suspects and a more esoteric option:

  • The most popular alternative was Visual Studio Code using the R Extension for Visual Studio Code plugin together with the languageserver R package.
  • Neovim together with the Nvim-R plugin.
  • Emacs (or one of its variants) + ESS (Emacs Speaks Statistics).

On the esoteric side Spacemacs (Emacs + layers of configuration), or an unholy combination of Emacs with Vim keybindings.

Any of these options will let you use Rmarkdown or Quarto if you are into that.

P.S. I focused on crossplatform options (I tend to move a lot), but a comment in Mastodon mentioned a Windows-only option that could be useful for a few people: Notepad++ together with the NppToR utility.

Some love for Base R. Part 3

It seems a few people have found useful the reminders of base-R functionality covered in “Some love for Base R” Part 1 and Part 2. So I will keep on mentioning a few bits and pieces that you may find handy when going back to Base or even visiting it for the first time.

A reminder: the fictional setting is that you are revisiting legacy code or developing new code under strong constraints: minimal use of packages. The latter could be because you are using webR or you’re keen on having few dependencies. I am assuming R 4.1 when mentioning native pipes, but not the existence of the _ placeholder yet.

In this post I play a little with variable names. None of this would be “production code”, but it would work fine in your analyses. I have similar code (except without pipes) that is almost 20 years old and still running.

A rose by any other name would smell as sweet

William Shakespeare

Changing variable names, renaming, does not work quite like in the tidyverse, in which it can be one more step in a list of transformations with rename(). In base R we rely on the names() function, which is used for both for listing the names of an object and changing them. Usually people would either change all names, providing a vector with names, or replacing one or more names by referring to their position, as in:

#[1] "breaks"  "wool"    "tension"

names(warpbreaks) <- c("bre", "woo", "ten")
#[1] "bre" "woo" "ten"

names(warpbreaks)[3] <- "tension"
#[1] "bre"     "woo"     "tension"

names(warpbreaks)[1:2] <- c("breaks", "wool")
#[1] "breaks"  "wool"    "tension"

One problem is relying on the position of the variable, which may change with different datasets. One option—although a bit wordy—is to use a regular expression to rename a specific variable with the base sub() function:

# General use
# names(data_frame) <- sub("old_name", "new_name", names(data_frame))

names(warpbreaks) <- sub("wool", "woolly", names(warpbreaks))
#[1] "breaks"  "woolly"  "tension"

Inside sub() we get a list of all the names for the data frame, look for the one that matches “wool” and replace it by “woolly”.

Cleaning names

A typical problem when receiving datasets is that the authors followed weird naming conventions or, more likely, no convention at all. There are shouting ALLCAPS, names separated by dots, or by spaces, or whatever. I usually work with lowercase names separated by underscores if more than one word. The easiest way to convert names is using janitor‘s clean_names() function.

Our data set’s names could look like this:

#[1] "BLOCK_NO"  "TREE_NO"   "FAM_CODE"  "age core"  "site.code"

I could write a bare bones clean names function in base (covering most of my cleaning needs) using the following code:

names(data_frame) |> tolower() |> {\(x) gsub("[\\. ]", "_", x)}()

For a data frame:

1. get names for the data frame names(data_frame)

2. pass them to the next function |> (using native pipe)

3. convert names to lowercase tolower()

4. replace dots and spaces with underscore using regular expression gsub()

The Klingonian part is a lambda function wrapped by { }() so it works with the native pipe and it is the same as {function(x) gsub("[\\. ]", "_", x)}(). One could perfectly write the code without pipes and using less Klingon.

Applying the function we’d get:

names(bos) |> tolower() |> {\(x) gsub("[\\. ]", "_", x)}()
#[1] "block_no"  "tree_no"   "fam_code"  "age_core"  "site_code"

Of course you’d need to assign the names, so they overwrite the existing ones. Easiest way would be to add -> names(bos) at the end of the line. A right-side assign (wink).

names(bos) |> tolower() |> {\(x) gsub("[\\. ]", "_", x)}() -> names(bos)

Some love for Base R. Part 2

Where were we? Giving some love to base-R and putting together the idea that it is possible to write R very clearly when using base. Two sets of typical issues:

Subsetting rows and columns

When running analyses we often want to work on a subset of all cases (rows) or variables (columns). People are used to filter() (for rows) and select() (for columns) in the tidyverse but then search how to do that in base and get ugly responses. For example, if we had a number of trials in a data frame called all_trials and we wanted to keep only a single one located in Christchurch we could try using sort of matrix notation, keeping the rows that meet the criterion, and all variables, as we don’t specify criteria for them:

my_trial <– all_trials[all_trials$location == "Christchurch", ]

# better, by using with(). More below
my_trial <- with(all_trials, all_trials[location == "Christchurch", ])

You could have been tempted to use all_trials[location == "Christchurch", ] by itself, but R wouldn't have known to look for location inside all_trials. Much clearer, though, would have been to use the subset() function from base R, which does the job of both filter() and select() in the tidyverse. It works like this:

subset(data_frame, conditions_for_rows, select = conditions_for_columns)

# we keep all columns, as we aren't using select
my_trial <– subset(all_trials, location == "Christchurch")

It is way clearer and pipe ready, as the first argument is the data frame name!

This code can easily be expanded to more complex conditions; for example to include all trees from Christchurch and (&)that are also taller than 10 m:

my_trial <– subset(all_trials, location == "Christchurch" & height > 10)

The dataset contains multiple variables but we only want to keep, say, location, block, height and diameter:

my_trial <– subset(all_trials, location == "Christchurch" & height > 10,
                   select = c(location, block, height, diameter))

with() and pipes

Another one. In the tidyverse functions are designed to receive the name of the data frame as the first argument, as in some_function(data = ..., other arguments). Most of the time in base R data is not the first argument and, in some cases, the functions do not take data = ... as an argument. The first case is not a problems, unless we want to use the base pipe |>. The second leads to either going for $ notation or, god helps us, using attach() to make our variables global. Note: never do this.

Argh! What to do? Here is where with() comes to life, being very useful for these two problematic cases. In essence, with(data_frame, function) is saying "look for the function arguments in the specified data_frame".

For example, this blog post gives a lengthy comparison of the %>% and |> pipes but, in my opinion, it complicates things a lot because is missing the use of with(). The post starts "When I am feeling lazy, I use base R for quick plots plot(mtcars$hp, mtcars$mpg)".

As a start, if I were feeling lazy I would've used plot(mpg ~ hp, mtcars), highlighting that the plot function already takes the data argument. In fact, I'm using it as plot(formula, data). If I needed data in the first place I could have simply used with(), which defaults to a data frame as the first argument:

mtcars |> with(plot(mpg ~ hp))

# This is simply calling
with(mtcars, plot(mpg ~ hp))

Instead, the author chooses to use anonymous (lambda) functions, which do have their place in R, but ends up with nasty looking code:

mtcars |> (\(x) plot(x$mpg ~ x$hp))()
# vs
mtcars |> with(plot(mpg ~ hp))
# or even
mtcars |> with(plot(hp, mpg))

I'm partial to using a formula in plot because I can easily visualise the underlying model in my head.

Some functions, mean() for example, don't take a data frame argument. Again, with() is your friend.

# This produces an error
mtcars |> mean(mpg)

# while this one works
mtcars |> with(mean(mpg))

Both within() (used in part 1) and with() will make your base code mucho moar readable (pun intended) and pipe ready.

Some love for Base R. Part 1

For a long time it has bothered me when people look down at base-R (meaning the set of functions that comes in a default installation), as it were a lesser version of the language when compared to the tidyverse set of functions or data.table or whatever. I think part of this situation is due to 1. much less abundant modern documentation and tutorials using base and 2. the treatment of base in those tutorials ignores base constructs that make it look, well, tidy.

This could be read as a curmudgeon complaining at clouds BUT there are occasions when relying on a minimal installation without any extra packages is quite useful. For example, when you want almost maintenance-free code between versions, base R is very stable. I have code that’s almost 20 years old and still running. As a researcher, this is a really good situation.

If you’re unconvinced, there is a much cooler application webr, or R in the browser. It happens to be that loading additional packages (particularly large ones, like the tidyverse) makes the whole R in the browser proposition much slower. A toot (this was in Mastodon, by the way) by bOb Rudis (hrbrmstr) got me thinking that:

@hrbrmstr It would be cool to see a renaissance of base R, leading to a tidier, much more readable use with with(), within(), etc.


Exchanging ideas on webR: packages are a pain, perhaps a base-R renaissance.
Exchanging ideas on webR: packages are a pain, perhaps a base-R renaissance.

Perhaps an example will show better what I mean. Let’s assume we go back in time and you find 15-year-old code that looks like this:

gw <- read.csv("data20091111.csv", header = TRUE)

gw$dratio <- gw$dia1.mm/gw$dia2.mm
gw$dia <- ifelse(is.na(gw$dia2.mm), gw$dia1.mm, (gw$dia1.mm + gw$dia2.mm)/2)
gw$slend <- gw$ht.cm/gw$dia
gw$smoe <- 1.1*gw$v2104^2 # Squared standing velocity in April
gw$gmoe <- 1.1*gw$vgreen^2
gw$dmoe <- (gw$bden/1000)*1.14*gw$vdry^2

Which, let’s face it, it doesn’t look pretty. If you were working in the tidyverse you’d probably also be using RStudio (and projects). If you are using projects, your code would look like:


gw <- read_csv("data20091111.csv")

gw <- gw %>% 
  mutate(dratio = dia1.mm/dia2.mm,
         dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2),
         slend = ht.cm / dia,
         smoe = 1.1 * v2104^2,
         gmoe = 1.1 * vgreen^2,
         dmoe = (bden / 1000) * 1.14 * vdry^2)

which is easier to read and follow in my opinion. However, there was nothing stopping you to write the following 15 years ago:

gw <- read.csv("data20091111.csv", header = TRUE)

gw <- within(gw, { 
  dratio<- dia1.mm/dia2.mm
  dia <- ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)
  slend <- ht.cm / dia
  smoe <- 1.1 * v2104^2
  gmoe <- 1.1 * vgreen^2
  dmoe <- (bden / 1000) * 1.14 * vdry^2

which is quite similar to the dplyr code, but without any external package dependencies. Now, if you are missing the magrittr pipes, from R 4.1.0 it is possible to use native pipes and write the above code as:

gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> within({ 
  dratio <- dia1.mm/dia2.mm
  dia <- ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)
  slend <- ht.cm / dia
  smoe <- 1.1 * v2104^2
  gmoe <- 1.1 * vgreen^2
  dmoe <- (bden / 1000) * 1.14 * vdry^2

which gets us even closer to the tidyverse. The real magic is using within() to specify inside which data frame we are looking for all the variables that are being referred by the calculations. This permits us writing dratio <- dia1.mm/dia2.mm instead of gw$dratio <- gw$dia1.mm/gw$dia2.mm. There are a few “tricks” like this to make base-R a very attractive option, particularly if you like minimal, few dependencies coding.

P.D. 2023-03-26: A couple of people asked in Mastodon Why not use transform() instead of within()? It is a good question, because transform() looks closer to mutate() with a call like:

transform(data_frame, transformation_1, transformation_2, etc)

But there is a subtle difference that creates an error in my previous example. In transform one cannot refer to variables previously created in the same transformation. Therefore, this fails:

gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> 
  transform(dratio = dia1.mm/dia2.mm,
            dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2),
            slend = ht.cm / dia,
            smoe = 1.1 * v2104^2,
            gmoe = 1.1 * vgreen^2,
            dmoe = (bden / 1000) * 1.14 * vdry^2)

because slend = ht.cm / dia refers to the variable dia created in the previous line. However, this could be fixed by using:

gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> 
  transform(dratio = dia1.mm/dia2.mm,
            dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)) |>
  transform(slend = ht.cm / dia,
            smoe = 1.1 * v2104^2,
            gmoe = 1.1 * vgreen^2,
            dmoe = (bden / 1000) * 1.14 * vdry^2)

There are parts 2, 3 and 4 of this post.

« Older posts

© 2024 Palimpsest

Theme by Anders NorenUp ↑