Over the birdsite dumpster fire. Emily Harvey was asking:

do you know of any good guidelines/advice for what one should do to sense check and make sure they understand any data before using it?

I replied the following:

Typically, I might be very familiar with the type of data and its variables (if it is one of my trials) or chat/email multiple times with the owner of the dataset(s) so I can check that:

  • units and recorded values match. If units are mm, for example, the magnitudes should make sense in mm.
  • the order of assessments and experimental/sampling design match: people often get lost in trials or when doing data collection, recording the wrong sampling unit codes.
  • dates are OK. I prefer 2023-04-07; anyway, this is often a problem when dealing with Excel data.
  • if we are using environmental data that it matches my expectation about the site. Have found a few weather station problems doing that, where rainfall was too low (because there was a sensor failure).
  • the relationship between variables are OK. Example of problems: tall and too skinny trees, fat and short ones, suspicious (unless broken, etc), diameter under bark smaller than over bark, big etc.
  • levels of factor match planned levels (typically there are spelling mistakes and there are more levels). Same issue with locality names.
  • map coverage/orientation is OK (sometimes maps are sideways). Am I using the right projection?
  • joins retain the appropriate number of rows (I mean table joins using merge or left_join in R, etc).
  • Missing values! Are NA coded correctly or with zeros, negative numbers? Are they “random”?
  • If longitudinal data: are older observations larger (or do we get shrinking trees?)
  • etc

Of course these questions are dataset dependent and need to be adapted to each separate situation. Finally: Do results make any sense?