Evolving notes, images and sounds by Luis Apiolaza

Category: r (Page 2 of 20)

Some love for Base R. Part 1

For a long time it has bothered me when people look down at base-R (meaning the set of functions that comes in a default installation), as it were a lesser version of the language when compared to the tidyverse set of functions or data.table or whatever. I think part of this situation is due to 1. much less abundant modern documentation and tutorials using base and 2. the treatment of base in those tutorials ignores base constructs that make it look, well, tidy.

This could be read as a curmudgeon complaining at clouds BUT there are occasions when relying on a minimal installation without any extra packages is quite useful. For example, when you want almost maintenance-free code between versions, base R is very stable. I have code that’s almost 20 years old and still running. As a researcher, this is a really good situation.

If you’re unconvinced, there is a much cooler application webr, or R in the browser. It happens to be that loading additional packages (particularly large ones, like the tidyverse) makes the whole R in the browser proposition much slower. A toot (this was in Mastodon, by the way) by bOb Rudis (hrbrmstr) got me thinking that:

@hrbrmstr It would be cool to see a renaissance of base R, leading to a tidier, much more readable use with with(), within(), etc.

Luis

Exchanging ideas on webR: packages are a pain, perhaps a base-R renaissance.
Exchanging ideas on webR: packages are a pain, perhaps a base-R renaissance.

Perhaps an example will show better what I mean. Let’s assume we go back in time and you find 15-year-old code that looks like this:

setwd("/Users/luis/Documents/Research/2008/Glasshouse")
gw <- read.csv("data20091111.csv", header = TRUE)

gw$dratio <- gw$dia1.mm/gw$dia2.mm
gw$dia <- ifelse(is.na(gw$dia2.mm), gw$dia1.mm, (gw$dia1.mm + gw$dia2.mm)/2)
gw$slend <- gw$ht.cm/gw$dia
gw$smoe <- 1.1*gw$v2104^2 # Squared standing velocity in April
gw$gmoe <- 1.1*gw$vgreen^2
gw$dmoe <- (gw$bden/1000)*1.14*gw$vdry^2

Which, let’s face it, it doesn’t look pretty. If you were working in the tidyverse you’d probably also be using RStudio (and projects). If you are using projects, your code would look like:

library(readr)
library(dplyr)

gw <- read_csv("data20091111.csv")

gw <- gw %>% 
  mutate(dratio = dia1.mm/dia2.mm,
         dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2),
         slend = ht.cm / dia,
         smoe = 1.1 * v2104^2,
         gmoe = 1.1 * vgreen^2,
         dmoe = (bden / 1000) * 1.14 * vdry^2)

which is easier to read and follow in my opinion. However, there was nothing stopping you to write the following 15 years ago:

setwd("/Users/luis/Documents/Research/2008/Glasshouse")
gw <- read.csv("data20091111.csv", header = TRUE)

gw <- within(gw, { 
  dratio<- dia1.mm/dia2.mm
  dia <- ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)
  slend <- ht.cm / dia
  smoe <- 1.1 * v2104^2
  gmoe <- 1.1 * vgreen^2
  dmoe <- (bden / 1000) * 1.14 * vdry^2
})

which is quite similar to the dplyr code, but without any external package dependencies. Now, if you are missing the magrittr pipes, from R 4.1.0 it is possible to use native pipes and write the above code as:

setwd("/Users/luis/Documents/Research/2008/Glasshouse")
gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> within({ 
  dratio <- dia1.mm/dia2.mm
  dia <- ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)
  slend <- ht.cm / dia
  smoe <- 1.1 * v2104^2
  gmoe <- 1.1 * vgreen^2
  dmoe <- (bden / 1000) * 1.14 * vdry^2
})

which gets us even closer to the tidyverse. The real magic is using within() to specify inside which data frame we are looking for all the variables that are being referred by the calculations. This permits us writing dratio <- dia1.mm/dia2.mm instead of gw$dratio <- gw$dia1.mm/gw$dia2.mm. There are a few “tricks” like this to make base-R a very attractive option, particularly if you like minimal, few dependencies coding.

P.D. 2023-03-26: A couple of people asked in Mastodon Why not use transform() instead of within()? It is a good question, because transform() looks closer to mutate() with a call like:

transform(data_frame, transformation_1, transformation_2, etc)

But there is a subtle difference that creates an error in my previous example. In transform one cannot refer to variables previously created in the same transformation. Therefore, this fails:

setwd("/Users/luis/Documents/Research/2008/Glasshouse")
gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> 
  transform(dratio = dia1.mm/dia2.mm,
            dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2),
            slend = ht.cm / dia,
            smoe = 1.1 * v2104^2,
            gmoe = 1.1 * vgreen^2,
            dmoe = (bden / 1000) * 1.14 * vdry^2)

because slend = ht.cm / dia refers to the variable dia created in the previous line. However, this could be fixed by using:

setwd("/Users/luis/Documents/Research/2008/Glasshouse")
gw <- read.csv("data20091111.csv", header = TRUE)

gw <- gw |> 
  transform(dratio = dia1.mm/dia2.mm,
            dia = ifelse(is.na(dia2.mm), dia1.mm, (dia1.mm + dia2.mm)/2)) |>
  transform(slend = ht.cm / dia,
            smoe = 1.1 * v2104^2,
            gmoe = 1.1 * vgreen^2,
            dmoe = (bden / 1000) * 1.14 * vdry^2)

There are parts 2, 3 and 4 of this post.

Creating an n x n autocorrelation matrix

Between covid-19 news and announcements of imminent Russia-Ukraine wars I needed a bit of a distraction. Sooo, here it is how to create an n x n autocorrelation matrix based on a correlation rho, with a simple 5 x 5 example in R:

n <- 5
rho <- 0.9
mat <- diag(n)
mat <- rho^abs(row(mat) - col(mat))

This produces the following output:

       [,1]  [,2] [,3]  [,4]   [,5]
[1,] 1.0000 0.900 0.81 0.729 0.6561
[2,] 0.9000 1.000 0.90 0.810 0.7290
[3,] 0.8100 0.900 1.00 0.900 0.8100
[4,] 0.7290 0.810 0.90 1.000 0.9000
[5,] 0.6561 0.729 0.81 0.900 1.0000

How does this work? Starting from defining n as 5, and the basic rho correlation as 0.9, then mat <- diag(n) creates a diagonal matrix—a square matrix with ones in the diagonal and zeros elsewhere—with 5 rows and 5 columns.

Continue reading

The beauty of code vectorisation

I came across this problem in Twitter:

Description of the problem by Fermat’s Library.

The basic pieces of the problem are:

  • We need to generate pseudorandom numbers with an identical distribution, add them up until they go over 1 and report back how many numbers we needed to achieve that condition.
  • Well, do the above “quite a few times” and take the average, which should converge to the number e (2.7182…).
Continue reading

Recreational programming

I think programming, aka coding, is a fun activity. We are solving a problem subject to a set of constraints that can be time, memory, quantity of code, language, etc. Besides being a part of my work, coding is also a good distraction when doing it for the sake of it.

In this type of situation I like to set myself additional constraints. For example, I try to use only base R or a very minimal set of features. Moreover, I prefer vectorised code: there are no explicit loops in the code. For toy problems this restriction doesn’t make much of a difference in execution time, but in real problems R (and Matlab) run much more slowly with non-vectorised code.

Continue reading

Fixing Rcpp warning in Mac OS

In Mac OS I was getting an annoying warning when compiling Cpp code via Rcpp in R:

ld: warning: directory not found for option '-L/usr/local/gfortran/lib/gcc/x86_64-apple-darwin15/version_number'

Adding a file called Makevars in ~/.R defining FLIBS to the actual location of gfortran in my machine fixed the problem. In my case, Makevars only contains the following line:

FLIBS=-L/usr/local/gfortran/lib

That’s it.

« Older posts Newer posts »

© 2024 Palimpsest

Theme by Anders NorenUp ↑