Evolving notes, images and sounds by Luis Apiolaza

# Category: stats(Page 1 of 8)

A few posts ago I was talking about heritabilities (like here) and it’s time to say something about genetic correlations. This is how I explain correlations to myself or in meetings with colleagues. Nothing formal, mostly an analogy.

Say we have to draw a distribution of breeding values for one trait (X) and, rather than looking from the side, we look at it from the top. It looks like a straight line, where the length gives an idea of variability and the cross marks the mean. We can have another distribution (Y), perhaps not as long (so not so variable) or maybe longer.

Often variables will vary together (co-vary, vary at the same time) and we can show that by drawing the lines at an angle, where they cross at their means. If you look at the formula for the covariance (co-variance, because traits co-vary, get it?), we grab the deviation from the mean for the two traits for each of the observations, multiply them, add them all up and get their average. We get positive values for the product when both traits are above or below the mean; we get negative values when one trait is below the mean and the other above it. Covariances are a pain, as they can take any value. Instead we can use “standardised” covariances, that vary between -1 and 1: we call these things *correlations*.

If the angle between the distributions is less than 90 degrees, increasing the values of one of the traits is (on average) accompanied by an increase on the other trait. then we have a positive covariance and, therefore, a positive correlation. The smaller the angle, the closer to a correlation of 1.

If the angle is 0 degrees (or close to it), changing the value of one trait has no (or very little) effect on the other trait. Zero correlation.

If the angle is greater than 90 degrees, changing the value of one trait tends to reduce the values of the other trait. The closer the angle to 180 degrees (so the positive values of one distribution are closer to the negative values of the other distribution), the closer to a -1 correlation.

Why do we care about these correlations? We use them all over the place in breeding. Sometimes as a measure of trade-off, as in “if I increase X, what will happen with Y?” or correlated response to selection. We also use them to understand how much information in one trait is contained in another trait, as in “can I use X as a selection criteria for Y?”. And a bunch of other uses, as well. But that’s another post.

This AIC looks way more fun than the other AIC for (soft toy) model selection.

There is no logical warrant for considering an event known to occur in a given hypothesis, even if infrequently, as disproving the hypothesis.

Joseph Berkson in “Tests of significance considered as evidence”. Journal of the American Statistical Association 37: 325-335.

Over the birdsite dumpster fire. Emily Harvey was asking:

do you know of any good guidelines/advice for what one should do to sense check and make sure they understand any data before using it?

I replied the following:

Typically, I might be very familiar with the type of data and its variables (if it is one of my trials) or chat/email multiple times with the owner of the dataset(s) so I can check that:

• units and recorded values match. If units are mm, for example, the magnitudes should make sense in mm.
• the order of assessments and experimental/sampling design match: people often get lost in trials or when doing data collection, recording the wrong sampling unit codes.
• dates are OK. I prefer 2023-04-07; anyway, this is often a problem when dealing with Excel data.
• if we are using environmental data that it matches my expectation about the site. Have found a few weather station problems doing that, where rainfall was too low (because there was a sensor failure).
• the relationship between variables are OK. Example of problems: tall and too skinny trees, fat and short ones, suspicious (unless broken, etc), diameter under bark smaller than over bark, big etc.
• levels of factor match planned levels (typically there are spelling mistakes and there are more levels). Same issue with locality names.
• map coverage/orientation is OK (sometimes maps are sideways). Am I using the right projection?
• joins retain the appropriate number of rows (I mean table joins using merge or left_join in R, etc).
• Missing values! Are NA coded correctly or with zeros, negative numbers? Are they “random”?
• If longitudinal data: are older observations larger (or do we get shrinking trees?)
• etc

Of course these questions are dataset dependent and need to be adapted to each separate situation. Finally: Do results make any sense?

Null hypotheses of no difference are usually known to be false before the data are collected … when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science

Savage 1957 cited by Nelder 1999 “From Statistics to Statistical Science”. The Statistician 48(2): 257-269.