R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustration with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines <- factor(LETTERS)
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA <- ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
#  [1]  1  2 NA  4  5  6 NA  8...

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!

linesNA <- factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
# [1] 1    2     4    5    6     8...
# Levels: 1 2 4 5 6 8...

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns's R Inferno PDF for details). We can deal directly with the levels using this:

linesNA <- lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] <- NA
# [1] A    B     D    E    F     H...
#Levels: A B D E F H...

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA <- factor(as.character(ifelse(lines %in% c('C', 'G', 'P'), NA, lines)))
# [1] A    B     D    E    F     H...
#Levels: A B D E F H...

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

, ,

13 responses to “R pitfall #3: friggin’ factors”

  1. Maybe this is more intuitive (also it is not really different to your approach):

    lines <- factor(LETTERS)

    linesNA <- lines

    levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

  2. My latest complaint about factors is this:

    R> factor(letters[1:10])
    [1] a b c d e f g h i j
    Levels: a b c d e f g h i j
    R> nchar(factor(letters[1:10]))
    [1] 1 1 1 1 1 1 1 1 1 2

    • Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

    • Thanks for pointing out the exact location. I like very much your writing in the Inferno!

  3. You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
    The sad thing is I had'nt solve it yet.

    P.D. In moments like this is when I wish to have formal training in R programming.

    • It should be straightforward; for example:

      a = c('a', 'b', 'c')
      save(a, file = 'whatever.Rdata')

      However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

      • Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

        > x.var <- rnorm(100)
        > save(x.var, file="foo")
        > rm(x.var)
        > something <- load("foo")
        > something
        [1] "x.var"

        My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
        Maybe I had a weird line of thought…

  4. Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.

    So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…


  5. When I first learned R (by myself) I had so many factors created, usually because of the base R data.frame()’s automystically changing character vectors into a factor type. So frustrating! Simply because I didn’t know to use stringsAsFactors = FALSE. My solution was to first thing convert the factors by using as.character()

    Then the dplyr package came along and it was a revelation in simplicity.

    • In the old times (this post is from 12 years ago), options(stringsAsFactors = FALSE) at the beginning of a script was a solution to what now is the default. dplyr is a great package, but it’s possible to write great, clean code in base R as well. For example, look at this post and parts 2 & 3.