I received an email from one of my students expressing deep frustration with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to
NA. Using simple letters as example names he was baffled by the result of the following code:
lines <- factor(LETTERS) lines #  A B C D E F G H... # Levels: A B C D E F G H... linesNA <- ifelse(lines %in% c('C', 'G', 'P'), NA, lines) linesNA #  1 2 NA 4 5 6 NA 8...
The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!
linesNA <- factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines)) linesNA #  1 2
4 5 6 8... # Levels: 1 2 4 5 6 8...
Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns's R Inferno PDF for details). We can deal directly with the levels using this:
linesNA <- lines levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] <- NA linesNA #  A B
D E F H... #Levels: A B D E F H...
We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:
linesNA <- factor(as.character(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))) linesNA #  A B
D E F H... #Levels: A B D E F H...
I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).
13 responses to “R pitfall #3: friggin’ factors”
Maybe this is more intuitive (also it is not really different to your approach):
lines <- factor(LETTERS)
linesNA <- lines
levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))
You can do:
lines[lines %in% c('C', 'G', 'P')] <- NA
My latest complaint about factors is this:
 a b c d e f g h i j
Levels: a b c d e f g h i j
 1 1 1 1 1 1 1 1 1 2
Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.
Several stumbling blocks with factors are shown at the beginning of Circle 8.2 of 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.p…
Thanks for pointing out the exact location. I like very much your writing in the Inferno!
You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
The sad thing is I had'nt solve it yet.
P.D. In moments like this is when I wish to have formal training in R programming.
It should be straightforward; for example:
a = c('a', 'b', 'c')
save(a, file = 'whatever.Rdata')
However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.
Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:
> x.var <- rnorm(100)
> save(x.var, file="foo")
> something <- load("foo")
My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
Maybe I had a weird line of thought…
Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.
So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…
I manually posted your comment. It seems that it was trapped in the system just while I was doing the transition between Intense Debate and WordPress’s default system and ended up nowhere to be seen.
When I first learned R (by myself) I had so many factors created, usually because of the base R data.frame()’s automystically changing character vectors into a factor type. So frustrating! Simply because I didn’t know to use stringsAsFactors = FALSE. My solution was to first thing convert the factors by using as.character()
Then the dplyr package came along and it was a revelation in simplicity.
In the old times (this post is from 12 years ago),
options(stringsAsFactors = FALSE)at the beginning of a script was a solution to what now is the default.
dplyris a great package, but it’s possible to write great, clean code in base R as well. For example, look at this post and parts 2 & 3.