What Would Akaike Do?

This AIC looks way more fun than the other AIC for (soft toy) model selection.

Photo: AIC Fun Box in Terminal Los Héroes, Santiago, Chile.

Infrequent doesn’t disprove

There is no logical warrant for considering an event known to occur in a given hypothesis, even if infrequently, as disproving the hypothesis.

Joseph Berkson in “Tests of significance considered as evidence”. Journal of the American Statistical Association 37: 325-335.

Sense-checking data

Over the birdsite dumpster fire. Emily Harvey was asking:

do you know of any good guidelines/advice for what one should do to sense check and make sure they understand any data before using it?

I replied the following:

Typically, I might be very familiar with the type of data and its variables (if it is one of my trials) or chat/email multiple times with the owner of the dataset(s) so I can check that:

  • units and recorded values match. If units are mm, for example, the magnitudes should make sense in mm.
  • the order of assessments and experimental/sampling design match: people often get lost in trials or when doing data collection, recording the wrong sampling unit codes.
  • dates are OK. I prefer 2023-04-07; anyway, this is often a problem when dealing with Excel data.
  • if we are using environmental data that it matches my expectation about the site. Have found a few weather station problems doing that, where rainfall was too low (because there was a sensor failure).
  • the relationship between variables are OK. Example of problems: tall and too skinny trees, fat and short ones, suspicious (unless broken, etc), diameter under bark smaller than over bark, big etc.
  • levels of factor match planned levels (typically there are spelling mistakes and there are more levels). Same issue with locality names.
  • map coverage/orientation is OK (sometimes maps are sideways). Am I using the right projection?
  • joins retain the appropriate number of rows (I mean table joins using merge or left_join in R, etc).
  • Missing values! Are NA coded correctly or with zeros, negative numbers? Are they “random”?
  • If longitudinal data: are older observations larger (or do we get shrinking trees?)
  • etc

Of course these questions are dataset dependent and need to be adapted to each separate situation. Finally: Do results make any sense?

Not a contribution to science

Null hypotheses of no difference are usually known to be false before the data are collected … when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science

Savage 1957 cited by Nelder 1999 “From Statistics to Statistical Science”. The Statistician 48(2): 257-269.
