David Smith at the Revolutions blog posted a nice presentation on “big data” (oh, how I dislike that term). It is a nice piece of work and the Revolution guys managed to process a large amount of records, starting with a download of 70GB and ending up with a series of linear regressions.
I’ve spent the last two weeks traveling (including a visit to the trial below) and finishing marking for the semester, which has somewhat affected my perception on dealing with large amounts of data. The thing is that dealing with hotel internet caps (100MB) or even with my lowly home connection monthly cap (5GB) does get one thinking… Would I spend several months of internet connection just downloading data so I could graph and plot some regression lines for 110 data points? Or does it make sense to run a linear regression with two predictors using 100 million records?
My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis? There are packages to query external databases (ROracle, RMySQL, RODBC, …, pick your poison), we can sample to explore the dataset, etc.
We can deal with a rather large dataset in our laptop but is it the best that we can do to deal with the underlying modeling problem? Just wondering.
Hello, nice post! I have not worked on 10 millions, but if you try anything close to regression on half a million you quickly run out of tools in R and left with no choice other than go back to STATA. And sometimes, like when the stuff you're looking up is quite rare, it does not make sense to sample your 500 000 records. Just my 5 cents.
Hi Juta,
Every year I run genetic evaluations with ~ 400,000 records using linear mixed models much more complicated than a linear regression. I have have hundreds of variance components, pedigrees, etc using ASReml-R (the package is free for academic use). Where I am struggling is in that many people are throwing brute force at "big data" problems, but we are still stuck at using relatively simple models; does one need 100 million records to get an understanding of a linear regression with two predictors?
If I had to analyze a dataset with what I assume is a large degree of complexity I would probably go for some interesting hierarchical models (the kind of stuff that you would see in, for example, Gelman and Hill's book).
My guess is that their end customer are big companies…
Cheers 🙂
Tal
Great post and great question, Luis. My take is yes, we do need to do big data in R. Even if the ultimate data set is small, it's very useful to be able to extract it with R, simply because the extraction process itself requires statistical thinking best done in R. Also there are several situations where statistical analysis of large data sets as a whole is necessary (and to Tal's point, many Revolution Analytics customers fall into this camp). Anyway, I elaborate on these points a bit more in this blog post: http://blog.revolutionanalytics.com/2011/11/why-w… which links to a couple of white papers which go down into even more detail.
Dear David,
I do understand that some people have a lot of data and that some (many, most?) of them prefer to work in a single language/environment; I tend to be in that camp if I can get away with it. However:
1. It may not be the most efficient way to actually deal with that amount of data (particularly if it is not a one-off analysis and the data are all the time growing and/or being updated).
\n2. It could be that the examples did not sell me the story: if I had so much data I would go for a much better understanding of the situation (see my comment to Juta above).
I think that Revolutions has done a great job at helping R attain 'street cred' in the commercial world. At the end of the day we are analyzing data to gain understanding and—in the case of many companies—influence customers, which I think will require tools to bridge databases and statistical software. I do think that we should do as much pre-processing as possible upstream (next to the database) but I can see the need for efficient data structures downstream.
Thanks for stopping by.
Many people do genome wide association studies where they have millions of genetic variants (data points) to find the one(s) correlated with a trait of interest. Similarly, I also deal on a daily basis with large DNA sequencing files (several GB). Of course there are other languages to parse such files (perl, python, etc) to get more manageable output which can then be uploaded in R for statistical analyses. At the same time, if all this could be processed in R directly, it makes life easier (and I would have to learn other computing languages!).
Hi Sebastien,
Yes, we do deal with genomic selection, although I tend to use other software for that. I assume that one tends to implement the algorithms for GS pretty close to the metal and I would go for specialist software in that case, unless I were working with toy examples. If you are lucky to have 50,000+ markers you will need all the efficiency you can get from highly optimized code (often in C or FORTRAN).
yes and no…
There are probably lots of cases where a large dataset is just lots of more of the same– but I think that you would want to build a few models on the whole dataset (or as large a piece as you can handle) at least once and get all the data into working R memory so you can bootstrap it to find the smallest sample you are happy with.
Then there are lots of cases when you are either actually looking for outliers or needles in a haystack. The examples above of genetic variants and sequencing files are interesting because you need all the data to look of genes loci that are different– and there are interesting 'big data' solutions within the public Bioconductor packages (distinct from Revolution) that put genomic data outside Rs own memory management (I guess like pointers…err sort of..).
Further there are cases where sampling is inappropriate..or at least not straightforward..for instance consider the topical story today of analysing network connectivity between Facebook users. If you wanted to repeat that on your laptop could you sample a small part of the dataset and still work out the average connectivity of the whole data set? (~4.6 I think it was)
I totally agree in that there are cases where we are "looking for outliers or needles in a haystack". My point is that we may want to look for it in the database rather than in R. Let's move from the small "big data" (a few GB) to larger "big data" (say several TB) or to often updated datasets, so we have to keep on retrieving from an ever-growing, ever-changing database. I think it would be much more efficient to do the data processing, outlier looking, needle finding in the database and then continue processing in R. Maybe what we need is a package that will translate the R syntax for dealing with dataframes to SQL (or whatever one uses in a huge database), if the user case is that we don't want to deal with the additional language load.
Sampling is part of my "professional deformation" as a forester. If I want to build a model to, say, predict the volume of trees given their stem diameter I will use a sample. I will not dream of assessing billions of trees in the forest, but I will sample (probably using several strata that reflect my general understanding of the problem) and then start my modeling process. In the case of Facebook they already have the data for hundreds of millions of people and I guess that they could get t a pretty good answer of the average connectivity by using many samples (at the same time getting an estimate of the standard error).
I accept that there are many cases where one wants to deal with the whole dataset (I do that most of the time) but I insist that if we really want to talk about Big data (with capital b), we will probably want to rely on the original database and, many times, on sampling.
Hi Luis,
Great gratuitous picture of radiata pine. I so wish we still did forestry here in the states.
Dear Norm,
We are certainly doing a lot of forestry in New Zealand and in parts of Australia (where I took the picture) and we are still establishing and analyzing quite a few trials (which I tend to cover in this blog). It is interesting to see how much of forestry has moved to either the tropics or the Southern hemisphere. Unfortunately, organizations like IUFRO seem to forget about the important role of production forests in people's lives.
I vaguely remember a talk by Brian Ripley at the ISI world congress in 2003 where he described using a combination of an single-PC R front end and a multi-PC back end connected via small bandwidth internet connnection to process big data… but I don't know whether his system (or a similar one) ever went into production stage.
That sounds very cool. I would like to know how and what are people doing with very large datasets besides running classification algorithms.