Category: r (Page 12 of 20)

Harvesting Eucalyptus urophylla x E. grandis hybrid clones in Brazil (Photo: Luis).

Review: “Forest Analytics with R: an introduction”

2012-05-30 / Luis

Forestry is the province of variability. From a spatial point of view this variability ranges from within-tree variation (e.g. modeling wood properties) to billions of trees growing in millions of hectares (e.g. forest inventory). From a temporal point of view we can deal with daily variation in a physiological model to many decades in an empirical growth and yield model. Therefore, it is not surprising that there is a rich tradition of statistical applications to forestry problems.

At the same time, the scope of statistical problems is very diverse. As the saying goes forestry deals with “an ocean of knowledge, but only one centimeter deep”, which is perhaps an elegant way of saying a jack of all trades, master of none. Forest Analytics with R: an introduction by Andrew Robinson and Jeff Hamann (FAWR hereafter) attempts to provide a consistent overview of typical statistical techniques in forestry as they are implemented using the R statistical system.

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

R’s increasing popularity. Should we care?

2012-05-17 / Luis

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Gratuitous picture: Trees at 8 pm illuminated by bonfire and full moon (Photo: Luis).

Bivariate linear mixed models using ASReml-R with multiple cores

2012-05-07 / Luis

A while ago I wanted to run a quantitative genetic analysis where the performance of genotypes in each site was considered as a different trait. If you think about it, with 70 sites and thousands of genotypes one is trying to fit a 70×70 additive genetic covariance matrix, which requires 70*69/2 = 2,415 covariance components. Besides requiring huge amounts of memory and being subject to all sort of estimation problems there were all sort of connectedness issues that precluded the use of Factor Analytic models to model the covariance matrix. The best next thing was to run over 2,000 bivariate analyses to build a large genetic correlation matrix (which has all sort of issues, I know). This meant leaving the computer running for over a week.
Continue reading

Gratuitous picture: looking for the right bicycle in Uppsala (Photo: Luis).

Teaching code, production code, benchmarks and new languages

2012-04-30 / Luis

I’m a bit obsessive with words. May be I should have used learning in the title, rather than teaching code. Or perhaps remembering code. You know? Code where one actually has very clear idea of what is going on; for example, let’s say that we are calculating the average of a bunch of n numbers, we can have a loop that will add up each of them and then divide the total by n. Of course we wouldn’t do that in R, but use a simple function: mean(x).

In a previous post I compared R and Julia code and one of the commenters (Andrés) rightly pointed out that the code was inefficient. It was possible to speed up the calculation many times (and he sent me the code to back it up), because we could reuse intermediate results, generate batches of random numbers, etc. However, if you have studied the genomic selection problem, the implementations in my post are a lot closer to the algorithm. It is easier to follow and to compare, but not too flash in the speed department; for the latter we’d move to production code, highly optimized but not very similar to the original explanation.

Late-April flotsam

2012-04-26 / Luis

It has been month and a half since I compiled a list of statistical/programming internet flotsam and jetsam.

Via Lambda The Ultimate: Evaluating the Design of the R Language: Objects and Functions For Data Analysis (PDF). A very detailed evaluation of the design and performance of R. HT: Christophe Lalanne. If you are in statistical genetics and Twitter Christophe is the man to follow.
Attributed to John Tukey, “without assumptions there can be no conclusions” is an extremely important point, which comes to mind when listening to the fascinating interview to Richard Burkhauser on the changes of income for the middle class in USA. Changes to the definition of the unit of analysis may give a completely different result. By the way, does someone have a first-hand reference to Tukey’s quote?
Nature news publishes RNA studies under fire: High-profile results challenged over statistical analysis of sequence data. I expect to see happening more often once researchers get used to upload the data and code for their papers.
Bob O’Hara writes on Why simple models are better, which is not positive towards the machine learning crowd.
A Matlab Programmer’s Take On Julia, and a Python developer interacts with Julia developers. Not everything is smooth. HT: Mike Croucher. ?
Dear NASA: No More Rainbow Color Scales, Please. HT: Mike Dickinson. Important: this applies to R graphs too.
Rafael Maia asks “are programmers trying on purpose to come up with names for their languages that make it hard to google for info?” These are the suggestions if one searches Google for Julia:

I suggest creating a language called Bieber and search for dimension Bieber, loop Bieber and regression Bieber.

That’s all folks.