Category: stats (Page 8 of 8)

Linear regression with correlated data

2011-10-06 / Luis

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:
Continue reading

R pitfall #1: check data structure

2011-10-05 / Luis

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.
Continue reading

A shoebox for data analysis

2011-10-04 / Luis

Recidivism. That’s my situation concerning this posting flotsam in/on/to the ether. I’ve tried before and, often, will change priorities after a few posts; I rationalize this process thinking that I’m cured and have moved on to real life.

This time may not be different but, simultaneously, it will be more focused: just a shoebox for the little pieces of code or ideas that I use when running analyses.

Derelict house in Sintra, Portugal (Photo: Luis).

Python code to simulate the Monty Hall problem

2011-04-24 / Luis

I always struggled to understand the Monty Hall Problem, which popped up again in William Briggs’s site. The program assumes that initially the prize is equally likely to be behind one of the three doors. Once the contestant (player in the program) chooses a door, the game host (who knows where is the prize) opens another door that does not contain the prize. The host then offers the contestant the choice between keeping his original choice or switch to the unopened door.

Initially the contestant has 1/3 probability of choosing the door with the prize and 2/3 of choosing the wrong door. If the contestant has chosen the right door, switching means losing; however, 2/3 of the time Monty Hall is in fact offering the prize.

from __future__ import division
from random import randint

# Number of games and game generation
ngames = 10000

prize = [randint(1,3)*x for x in ngames*[1]]
player = [randint(1,3)*x for x in ngames*[1]]

# Evaluating success
default = 0
monty = 0

for game in range(ngames):
    # Keeping decision
    if prize[game] == player[game]: 
        default = default + 1

    # Taking chance with Monty
    if prize[game] == player[game]:
        # Switching means losing
        pass
        # Switching means winning
    else:
        monty = monty + 1
    
print 'Probability of winning (own door)', default/ngames
print 'Probability of winning (switching doors)', monty/ngames

The final probabilities should be pretty close to 1/3 and 2/3, depending on the number of games (ngames) used in the simulation.