Category: programming (Page 5 of 6)

Overlay of design matrices in genetic analysis

2012-10-12 / Luis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.
Continue reading

Gratuitous picture: a simple summer lunch

Scraping pages and downloading files using R

2012-10-02 / Luis

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.
Continue reading

Gratuitous picture: horse race in the beach (Photo: Luis).

m x n matrix with randomly assigned 0/1

2012-08-29 / Luis

Today Scott Chamberlain tweeted asking for a better/faster solution to building an m x n matrix with randomly assigned 0/1. He already had a working version:

r <- 1000
c <- 1000
m0 <- matrix(0, r, c)
apply(m0, c(1,2), function(x) sample(c(0,1),1))

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

R’s increasing popularity. Should we care?

2012-05-17 / Luis

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Gratuitous picture: looking for the right bicycle in Uppsala (Photo: Luis).

Teaching code, production code, benchmarks and new languages

2012-04-30 / Luis

I’m a bit obsessive with words. May be I should have used learning in the title, rather than teaching code. Or perhaps remembering code. You know? Code where one actually has very clear idea of what is going on; for example, let’s say that we are calculating the average of a bunch of n numbers, we can have a loop that will add up each of them and then divide the total by n. Of course we wouldn’t do that in R, but use a simple function: mean(x).

In a previous post I compared R and Julia code and one of the commenters (Andrés) rightly pointed out that the code was inefficient. It was possible to speed up the calculation many times (and he sent me the code to back it up), because we could reuse intermediate results, generate batches of random numbers, etc. However, if you have studied the genomic selection problem, the implementations in my post are a lot closer to the algorithm. It is easier to follow and to compare, but not too flash in the speed department; for the latter we’d move to production code, highly optimized but not very similar to the original explanation.