This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good ThingTM. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability‡ and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).
We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.
How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.
I like it because:
- It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.
- It also manages any number of traits assessed in the experimental units.
- It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using
skip
in R orfirstobs
in SAS).
People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).
It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!
P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.
P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.
‡As an interesting aside, the post A summary of the evidence that most published research is false provides a good summary for the need to freak out about repeatability.
I’m definitely the R zealot in my group and in the subject that I teach. It’s at the point where people have suggested I get a T-shirt printed up that says “You weren’t born knowing how to walk, so you might as well learn how to use R” (a combination of two phrases I tend to repeat to first year students).
I really don’t have a problem with people using Excel to do data entry, manipulation, storage of meta-data, etc. I think it’s completely unreasonable to expect that everyone stores their data in long format in a CSV file or that everyone sets up a database for every single project they work on and we use read.table and reshape to do any manipulation.
I do have a huge problem when people use Excel inconsistently and non-transparently to do analysis that isn’t properly documented in the paper or meta-data. That’s the point at which I need to sit down with them and figure out what their Excel sheet is doing so I can save the data somewhere and rewrite the analysis as an R script so that I can generate the plots they want me to produce and make any changes to the analysis/data without having to drag and drop formulae everywhere in an XLSX file.
Hi Sam,
Personally I do not accept any data analysis in Excel in our research team. Zero. I have two fundamental reasons: first, I don’t trust Excel (or any other spreadsheet for that matter) as an analytic tool; second, it limits the options for analyses tremendously. Our data sets are fairly expensive to obtain and I will use any technique that permites me wringing the last bit of information from them; this will be some times R and some times other software, open- or closed-source.
Students can play with the odd plot, but I can’t have tens or hundreds of thousand of dollars riding on a linear regression done in Excel. Furthermore, our experiments were designed in such a way that mixed models are a necessity.
Big data sets will eventually end up in a data base though.
I appreciate the post and the tone. Frankly, I don’t think data entry in Excel is that great – it tends towards being error prone, though you can set up validation rules and so on, which helps… but if you are working that hard already, maybe other tools are better. However, Excel can be good for very small amounts of data, or for taking a quick look and doing a few calculations – perhaps finding out how many non-zero values are in each column, a histogram, things like that.
It is also brilliant for its own homegrown tasks (which are not data analysis – balance sheets, budgets, etc).
All of that and more can be done in JSON though (see http://timotheepoisot.fr/2013/12/02/json-goto-format/). I agree with excel for data entry if there is no other way, but it would never use it for more than that…
I’m OK with JSON. Now, how do you type your field data in JSON?
YAML, for example. There is nothing easier to type than that. Or you spend 30 minutes writing a simple django app sitting in the field laptop, and data go straight in a db. My point is: you can always avoid excel.
Or you spend 30 minutes writing a simple django app sitting in the field laptop Here is where the normal user departs. What’s django? Do you write that? What? A text editor? With a snake in it? Sure you can always avoid Excel but you are not the target audience for this post.
And me not being the target audience for this post is why I care so much about training and helping people develop (what I consider) better practices. We are not supposed to be satisfied with the status quo…
Hi Tim,
It isn’t possible to nest comments anymore, so I continue here. I am not satisfied with the status quo, but working to expand the understanding of statistics to more people, in a different way than you are. I do not think statistics is about code at all; code & software are just tools to facilitate our understanding or expand our capabilities to (too) numerically intensive problems.
Some people have the natural ability, infinite patience (or both) to do what you like doing. Most people don’t. I like making stats accessible to the latter group, because I think it is bigger and they deserve better. That’s it.
The dogs bark and the caravan moves on.
Most scientists use excel, every day; fewer use R
so the majority of PhDs have never heard of R and never will, and you know something ? they are doing great work, and not wasting their time learning computer programs that don’t help.
sure, a few have data sets that require R; and you know ? I have no problem with that;have fun; knock yourselves out, just don’t get all preachy on my about how I have to use a CNC 5 axis milling machine with auto tool changer to drill a hole
Fair enough. My experiments require linear mixed models for the analysis, so Excel can’t be used.
While I’d typically start off by saying that fanaticism in general is counterproductive if only because it ceases to be rational, I do realize that but for the fanatics (or, as some might more politely say, the passionistas) no great advancements in R’s own interface, functions and usability happen…. Its only because other programs do stuff so much better than R can that the acute need to fill yawning gaps really ever gets acted upon only… but I repeat myself.
Yup, I’m an R hardliner who doesn’t mind Excel at all, just not for my work whether inside or outside the classroom.
R and Excel can work well together in research and in business. Sometimes its nice to fiddle around with numbers in Excel while coding up algorithms in R or other languages. On the business side in financial services Excel is not only used for its spreadsheet functionalities but also as a dashboard to visualize, model and interface to wider systems via specialized add-ins, for example a real-time data feed via Bloomberg, or a connection to a custom compute server. Also there is XLConnect to create spreadsheets directly from R, and also to read data from spreadsheets, or RExcel to directly call R from within Excel.
Actually I do think Excel is too expensive, the newer versions are cluttered with counter intuitive menu options, while the open and libre office alternatives have their own shortcomings. So, there is an ongoing love-hate relationship with Excel that unlikely to be resolved any time soon,
I know where you’re coming from, but I think .csv, never .xls, because of things like this:
https://coffeehouse.dataone.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/
Excel’s issues with dates are well know (and documented). Moving dates between different programs without checking is madness, as dates are a very complex issue to represent. Having said that, I do not use dates in Excel, because I do not deal with date data in my analyses.