A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer.
I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that. Continue reading
I try to be economical when writing code; for example, I tend to use single quotes over double quotes for characters because it saves me one keystroke. One area where I don’t do that is when typing TRUE and FALSE (R accepts T and F as well), just because it is clearer to see in code and syntax highlighting kicks in. That’s why I was surprised to read Jason Morgan’s post in that it is possible to redefine T and F and get undesirable behavior. Continue reading
The email wasn’t a challenge but a simple question: Is it possible to run a multivariate analysis in multiple sites? I was going to answer yes, of course, and leave it there but it would be a cruel, non-satisfying answer. We can get a better handle of the question if we use a simple example; let’s assume that we have two traits (call them tree stem diameter and stem density) assessed in two sites (localities).
Because this is genetics we have a family structure, let’s say half-siblings so we only half the mother in common, and we will ignore any experimental design features to keep things simple. We have 100 families, with 30 trees each, in sites A and B, for a total of 6,000 trees (100 x 30 x 2). The data could look like this: Continue reading
I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education. Continue reading
This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):
Now let’s compare this number with the school directory:
summary(factor(directory$school.type))
# Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6)
# 4 149 775
# Correspondence School Full Primary (Year 1-8) Intermediate (year 7 and 8)
# 1 1101 122
#Restricted Composite (Yr 7-10) Secondary (Year 11-15) Secondary (Year 7-10)
# 4 2 2
# Secondary (Year 7-15) Secondary (Year 9-15) Special School
# 100 238 39
# Teen Parent Unit
# 20
As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If
# Performance of Contributing (Year 1-6) and
# Full Primary (Year 1-8) looks pretty much the
# same. Composites could be safely merged
qplot(school.type, reading.OK,
data = standards, geom = 'jitter')
qplot(school.type, writing.OK,
data = standards, geom = 'jitter')
qplot(school.type, math.OK,
data = standards, geom = 'jitter')
# Merging school types and plotting them colored
# by decile
standards$school.type.4 <- standards$school.type
levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary',
'Primary', 'Intermediate',
'Composite', 'Secondary')
qplot(school.type.4, reading.OK, colour = decile,
data = standards, geom = 'jitter')
I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion.
P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.)