Evolving notes, images and sounds by Luis Apiolaza

There are several blog posts, websites (and even books) explaining the transition from using another statistical system (e.g. SAS, SPSS, Stata, etc) to relying on R. Most of that material treats the topic from the point of view of i- an individual user and ii- a researcher. This post explains some of the issues involved in, first, moving several users and, second, with an emphasis in teaching.

I have made part of this information available before, but I wanted to update it and keep it together with all the other posts in Quantum Forest. The process started in March 2009.

## March 2009

I started explaining to colleagues my position on using R (and R commander) for teaching purposes. Some background first: forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, we use all sort of complex linear and non-linear regression models to predict growth, linear mixed models are the bread and butter of the analysis of experiments, etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reached the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Until the end of 2009 SAS (mostly BASE and STAT, with some sprinkles of GRAPH) was our robust tool. SAS was introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

• We spent a very long time explaining how to write simple SAS scripts. Students forgot the syntax very quickly.
• SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
• SAS is extremely expensive.
• SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
• Users sometimes store data in SAS’s own format, which introduces another source of lock-in.

At the time, in my research work I used mostly ASReml (for specialized genetic analyses) and R (for general work); since thenI have moved towards using asreml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I was using SAS to be consistent with second-year material.

Considering the previously mentioned barriers for students I started playing with R-commander (Rcmdr), a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books, by the way. As I see it:

• R in command mode is not more difficult (but not simpler either) for students than SAS. I think that SAS is more consistent and they have worked hard at keeping a very similar structure between PROCs.
• We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
• It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
• It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
• R is open source (nice, but not a deal breaker for me) and easily extensible (this one is really important for me).

At the time I thought that R would be an excellent fit for teaching; nevertheless, there could be a few drawbacks, mostly when dealing with postgrads:

• There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines. In two years this has never been a problem.
• Some people have an investment in SAS and may not like the idea of using a different software. This was a problem the first few months.

As someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6

.

## Five months later: August 2009

At the department level, I had to spend substantial time compiling information to prove that R could satisfy my colleagues’ statistical needs. Good selling points were nlme/lme4, lattice/ggplot2 and pointing my most statistically inclined colleagues to CRAN. Another important issue was the ability to have a GUI (Rcmdr) that could be adapted to our specific needs. At that time the School of Forestry adopted R as the default software for teaching any statistical content during the four years of the curriculum.

At the university level, my questions to the department of Mathematics and Statistics sparkled a lot of internal discussion, which resulted in R being adopted as the standard software for the ANOVA and regression second year courses (it was already the standard for many courses in 3rd and 4th year). The decision was not unanimous, particularly because for statisticians SAS is one of those ‘must be in the CV’ skills, but they went for change. The second year courses are offered across colleges, which makes the change very far reaching. These changes implied that many computers in the university labs now come with R pre-installed.

## A year later: April 2010

R and R-commander were installed in our computer labs and we started using them in our Research Methods course. It is still too early to see what will be the effect of R versus SAS, but we expect to see an increase on the application of statistics within our curriculum.

One thing that I did not properly consider in the process were the annoying side-effects of the university’s computer policies. Students are not allowed to install software in the university computers and R packages fall within that category. We can either stay with the defaults + R commander (our current position) or introduce an additional complication for students, pushing them to define their own library location. I’d rather teach ggplot2 than lattice, but ggplot2 is an extra installation. Choices, choices… On the positive side, the default installation for some of the computer labs install all the packages by default.

## Two years later: March 2011

Comments after teaching a regression modeling course using R-commander:

• Some students really appreciate the possibility of using R-commander as their ‘total analysis system’. Most students that have never used a command line environment prefer it.
• Students that have some experience with command-line work do not like much R-commander as they find it confusing, particularly when it is possible to access the R console through two points: Rcmdr and the default console. Some of them could not see the point of using an environment with a limited subset of functionality.
• Data transformation facilities in R-commander are somewhat limited to the simplest cases.
• Why is that the linear regression item does not accept categorical predictors? That works under ‘linear models’, but it is such an arbitrary separation.
• The OS X version of R-commander (under X Windows) is butt ugly. This is not John Fox’s fault, but just a fact of life.

In general, R would benefit of having a first-class Excel import system that worked across platforms. Yes, I know that some people say that researchers should not use Excel; however, there is a distinction between normative and positive approaches to research. People do use Excel and insisting that they should not is not helpful.

I would love to hear anyone else’s experiences teaching basic statistics with R. Any comments?

1. G. Grothendieck

There is info on R packages for Excel here: http://rwiki.sciviews.org/doku.php?id=tips:data-i

2. Etienne B. Racine

Interesting post. You should have a look at Deducer package. It offers a great interface to work with tabular data and has an ANOVA interface, among others.

3. Luis

Thanks G and Etienne for the recommendations. I'm looking for the right combination of packages that we can use for teaching, achieving the most functionality and easiness of use with the smallest number of additions.

4. Bob Muenchen

Very interesting. We're struggling with the same transition at The University of Tennessee. R Commander offers the most plug-ins but Deducer is much more modern looking & its Plot Builder for ggplot2 graphs is really nice. Plus its integration into JGR gives programmers a much better environment than the standard one that comes with R. Installation can be problematic, but a new installer in on the way.

Red-R is the only GUI that allows you to rerun the program without calling up the R code itself, which is a plus. Revolution Analytics has SPSS' own GUI designer working on an R GUI that will be free for academia. I looks like in about six months we'll have a much better view of where GUIs are headed.

Cheers,
Bob Muenchen http://r4stats.com

• Luis

Hi Bob,

Thanks for stopping by. I remember having a quick look at Deducer a while ago but I didn't feel that it was ready to be used in a course at the time. I will have a look again, because I don't need to teach any stats until late February, so there is still some time for the tools to catch up with our requirements. A nice interface, even a commercial one like Revo but free for academic use, would be really good. An interesting trend though is that macs are becoming very popular with students and Revolution Analytics may want to consider working beyond Windows-only software.

PD 2011-10-23. I just realized that http://www.red-r.org/ is the type of visual interface (Vista -like, that I mentioned in a comment below) I was hoping to see for R. Thanks for the suggestion.

5. xingmowang

I have tutored 100 level Stats, Math and 200 level Finance at VUW from 2006 to 2008. I used R to write all my tutorial answers — demonstration using graphs, stem plot in console and animation (lectures in those courses used pen and projector, lame right?). When I saw my evaluation sheet at the end of those tutorials, students gave me a very good score on preparation and on stimulating their interests.

I guess if lectures tell undergraduate students (Stats, Econ, Finance majors) that R is THE tool in future job market they may conquer the initial learning curve and dive themselves in R. If using R is optional for undergrads, it must be a compulsory for postgrads/honours. I was introduced to R in my honours year, thanks to my Quan and Finance lectures. I never stop using it from then. Even working at a ministry right now, I still constantly impress my manager and colleagues about what R can do. I am planning on giving R tuts to my team in the ministry and this idea is really positively perceived by my superiors.

Last, from a lecture's point of view, I would definitely encourage students to use R because it is free; because you can just use your own PC to finish assignments instead of going to school lab; because if you do that you can save some travel costs; because more and more ppl are using R. Damn there are countless reasons. I will stop here.

In terms of GUI, R + a free editor (with syntax highlighter) would be ideal. Just let students give up on clicking around and teach them that they need to document everything done on the data (do not be like some climate scientists, haha).

• Luis

I think it boils down to knowing your audience. I teach regression modeling, a second year course with a highly diverse student population. Between one-quarter and one-third of the students come from forestry (for whom the course is compulsory), while the rest may come from psychology, finance, engineering, science, maths and stats, you name it.

A very small proportion of the course will continue using stats in a regular basis, while the rest will use statistics only a few times in their professional life. In my opinion, both groups must become very familiar with the underlying concepts; however, the former should also be very familiar with a tool (in this case the R language), while the second group will probably be better served with the system of menus, for which there is no need to remember all the commands.

Personally I enjoy much more using environments like RStudio (or even the OS X version of R) than the combination text-editor/R. I'm not a big fan of, for example, emacs/ESS or Vim/R or Textmate/R, etc. On terms of GUI, I would love to see an implementation of something like Vista in which it was possible to visually document the analyses. Probably it was ahead of its time, but to me it makes more sense than a traditional point and click interface..

6. jianfeng.mao

I have benefit much from statistics I learned in Forestry department in the colleage 11 years ago, and also from R which I learned by myself four years ago. I am glad to see such an enthusiastic teacher, who are engaging on teaching statistics and R to his students, especially the students in Forestry. I am appreciated to teachers who are intend to improve statistical application of their students. The following sentences just reflects the heart of statistical teacher. Thanks a lot.

Some background first: forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, we use all sort of complex linear and non-linear regression models to predict growth, linear mixed models are the bread and butter of the analysis of experiments, etc.

I am now a biological researcher. Students graduated from Forestry department usually have robust trainings both on biology and statistics/math. The latter one eases my life of biological research with too much computational stuffs.

• Luis

7. Tom

I'll just make a quick plug for Rstudio. I just finished teaching a 1-day course in R basics and Rstudio made the whole idea of R so much more palatable to the participants. Something about an interface that looks good puts the mind at ease and makes learners think, okay, I can do this. As well, it provides the point-click ability for the initial commands that can trip up brand-new R users.

• Luis

Hi Tom,

Thanks for reminding me. I have been toying with the idea of using (for teaching purposes) Rstudio as the default "console" version for R plus Rcommander or an alternative as a menu driven version. I may write something about this in the future.

• BJM

Just wanted to say that I am learning to use R and looked around at many GUIs for different OSes and settled on RStudio as my preference. So, another +1 for RStudio.

8. Eric

I have found a wonderful site called stats4stem.org to teach statistics using R. The site has a variety of resources, however, one resource that I have found really useful is a web-based interface for R. It's called Rweb. I am currently using this site to teach statistics to regular high school students, and they love using R and the Rweb interface. One thing that students find helpful, is directly above the web interface is a menu of tutorials on how to accomplish a variety of statistical objectives (graphing, linear regression, simulations, etc). Now all my students want to use R and Rweb during exams. I feel that by using this site, students are learning statistics while also using a great skill that can be applied to a variety of fields, R. The Rweb interface has some nice features for students just learning R, they are:

1) multiple graphs can be viewed and displayed at 1 time
2) coding is done as just a regular text file, students find editing far easier than with traditional R
3) it is web based so access can be attained with just a web browser
4) students have easy access to R tutorials directly above the programming window
5) a variety of libraries are pre-installed (you still have to type library(…))
6) the site also has a datasets section for easy access to pre-installed R datasets

• Luis

Thanks for the information Eric. I’m not a big fan of Rweb though, because it is unclear to me what problem is solving from the point of view of teaching both statistics and R.