Poisson Regression, Making it Count

Introduction

Poisson regression is modeling technique that can be used to predict counts. In this case, we define counts as the number of occurances of an event in a given period of time. An equavalent way to looking at counts, is to look at rates. In statistics, we frequenly use the Poisson distribution when working with counts of events, and that concept can be extended to regression. This notebook will look at a regression model for the number of bicyclists that cross a brige in New York based on predicting variables that include the day of week, the temperature (high and low), and the presence of precipitation.

The data set

The data for this model started with a kaggle bicycle challenge data set. Unfortunately, the data listed in this Kaggle challenge is not valid. In the Kaggle version of the data, the month of April is repeated several times, and the other months are not included. The data has been been corrected for this notebook using the original source at , and a comment has been posted to the Kaggle site with the corrected source data set.

We will begin by doing some initial data exploration on this corrected data set.

bikes_raw <- read.csv('NYC_Bicycle_Counts_2016_Corrected.csv', header=T, stringsAsFactors=F)
summary(bikes_raw)
##      Date                DOY            Day               HighTemp    
##  Length:214         Min.   : 92.0   Length:214         Min.   :39.90  
##  Class :character   1st Qu.:145.2   Class :character   1st Qu.:66.05  
##  Mode  :character   Median :198.5   Mode  :character   Median :78.10  
##                     Mean   :198.5                      Mean   :74.93  
##                     3rd Qu.:251.8                      3rd Qu.:84.90  
##                     Max.   :305.0                      Max.   :96.10  
##     LowTemp      Precipitation      Brooklyn.Bridge Manhattan.Bridge
##  Min.   :26.10   Length:214         Min.   : 504    Min.   : 997    
##  1st Qu.:53.23   Class :character   1st Qu.:2388    1st Qu.:3713    
##  Median :64.90   Mode  :character   Median :3076    Median :5132    
##  Mean   :61.97                      Mean   :3031    Mean   :5052    
##  3rd Qu.:71.10                      3rd Qu.:3685    3rd Qu.:6610    
##  Max.   :82.00                      Max.   :8264    Max.   :9152    
##  Williamsburg.Bridge Queensboro.Bridge     Total      
##  Min.   :1440        Min.   :1306      Min.   : 4335  
##  1st Qu.:4884        1st Qu.:3496      1st Qu.:14826  
##  Median :6334        Median :4342      Median :19002  
##  Mean   :6161        Mean   :4301      Mean   :18545  
##  3rd Qu.:7858        3rd Qu.:5308      3rd Qu.:23254  
##  Max.   :9148        Max.   :6392      Max.   :28437

We begin by creating a categorical variable for the day of the week. For this exercise, we will focus specifically on the Williamsburg bridge. As a result, we will remove some of the other columns we don’t need as well. The data now has these predictor variables:

  • doy: Day of the year
  • weekday: Day of the week
  • hightemp: The daily high temperature
  • lowtemp: The daily low temperature
  • precip: The amount (and type) of precipitation
  • count: The Target variable for this regression
bikes <- data.frame(bikes_raw)
bikes <- data.frame(bikes[c('Day','HighTemp','LowTemp', 'Precipitation','Williamsburg.Bridge')])
names(bikes) <- c('weekday','hightemp','lowtemp','precip','count')
head(bikes)

Here we can see something a potential problem with the data, namely that the precip column has some nonnumeric data in it to destinguish (T)race and (S)now types of precipitation. In order to keep things simple, and reduce the number of predictor variables, we are going to make the following adjustments… (We will assume that these bicyclists are tough and NYC bicyclists are tougher still. Otherwise, why would they be bicycling in NYC?!)

We will split the precip column into two precip_rain and precip_snow. We will put there respective values in each. For trace amounts, we will convert those values to zero.

na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}

get.snow <- function(x) {
  result <- 0
  if (grepl('(S)',x)){
    result <- as.double(gsub("[^0-9.-]", "", x))
  }
  return(result)
}

#  Easy part with rain.  This will create a warning, but we will deal with that...
nyc_bikes <- within(bikes, precip_rain<-as.double(precip))
## Warning in eval(substitute(expr), e): NAs introduced by coercion
#  Here we deal with the NAs
cat("\n\n")
nyc_bikes$precip_rain <- na.zero(nyc_bikes$precip_rain)

nyc_bikes$precip_snow <- lapply(nyc_bikes$precip, get.snow)
nyc_bikes$precip_snow <- as.numeric(nyc_bikes$precip_snow)
nyc_bikes <- data.frame(nyc_bikes[c('weekday','hightemp','lowtemp','precip_rain','precip_snow','count')])

# Finally we will convert weekday to a categorical variable.
nyc_bikes$weekday <- as.factor(nyc_bikes$weekday)
head(nyc_bikes)
cat("\n")
summary(nyc_bikes)
##       weekday      hightemp        lowtemp       precip_rain    
##  Friday   :31   Min.   :39.90   Min.   :26.10   Min.   :0.0000  
##  Monday   :31   1st Qu.:66.05   1st Qu.:53.23   1st Qu.:0.0000  
##  Saturday :31   Median :78.10   Median :64.90   Median :0.0000  
##  Sunday   :31   Mean   :74.93   Mean   :61.97   Mean   :0.1069  
##  Thursday :30   3rd Qu.:84.90   3rd Qu.:71.10   3rd Qu.:0.0400  
##  Tuesday  :30   Max.   :96.10   Max.   :82.00   Max.   :1.6500  
##  Wednesday:30                                                   
##   precip_snow           count     
##  Min.   :0.000000   Min.   :1440  
##  1st Qu.:0.000000   1st Qu.:4884  
##  Median :0.000000   Median :6334  
##  Mean   :0.002196   Mean   :6161  
##  3rd Qu.:0.000000   3rd Qu.:7858  
##  Max.   :0.470000   Max.   :9148  
## 
write.csv(nyc_bikes, "nyc_bike_counts.csv", row.names = F)

Baring any further cleaning, we will turn our attention to modeling.

Let’s continue our exploratory analysis by looking at the distribution of the response variable based on the categorical variable, weekday. We will do that with the help of the MASS library in R.

library(MASS)
nyc_bikes$weekday <- factor(nyc_bikes$weekday, levels=c('Sunday', 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
boxplot(count~weekday, xlab="Weekday", ylab="Count", data=nyc_bikes)

Taking a look at the quantitative variables, hightemp, lowtemp, precip_rain, precip_snow and count, we have the following:

nyc_bikes_quant=data.frame(nyc_bikes[c('hightemp','lowtemp','precip_rain','precip_snow','count')])
plot(nyc_bikes_quant)

Here we can see some pretty strking multicollinearities, particularly around temps. Also, it is pretty clear that the data is cyclic in terms of temps. We could, if we wanted to, try to do some seasonality reduction for the temperature variables. Also, we note that the snow and rain are mutually exclusive. As part of this analysis, we will see which variables are adding value, and may choose to remove those later. Also, we notice a couple of outliers for snow. We will come back to those concerns later.

Mathematical Preliminaries

In Poisson regression, as we mentioned before, we attempt to model a response variable that is a rate based on predictor variables. This is special case of regression. It may be tempting to just use a regular regresison problem in this case, however Poisson regression does have some specific advantages.

When we model data with regression, we make certain assumptions. Those assumptions allow us to make statistical statements about our model and predictions. For standard linear regression (SLR), we make the following assumptions:

Standard Linear Regression:

\[ Y_i=\beta_0+ \beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}+\epsilon_i, i=1,...,n\] The assumptions that we make have to do with the error term \(\epsilon\) above. Specifically…

  • Linearity/Mean Zero: In other words, the expected value of the error terms in our model is zero \(E(\epsilon_i)=0\).
  • Constan Variance: The variance of the error terms is constant e.g. \(Var(\epsilon_i)=\sigma^2\)
  • Independence: {\(\epsilon_1,...,\epsilon_n\)} are independent random variables.
  • Normality: The errors are normally distributed: \(\epsilon_i\sim N(\mu,\sigma^2)\)

In our case, however, the response variable does NOT follow a normal distribution. Ok. That is fine, and it gives rise to the Generalized Linear Model…

\[g(E(Y|x_i,...,x_p))=\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}\] \[OR\]

\[E(Y|x_i,...,x_p)=g^{-1}(\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p})\] Where \(g()\) is a “link” function and \(g^{-1}()\) is the inverse of that function. We pick the link function based ont he distribution of Y. In this case we pick the link function based on the fact that Y above belongs to a family of “exponential” distribution. In the case where our response data is distributed as a Poission random variable, we use the following approach:

\[ ln(E(Y|x_i,...,x_n)) = \beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p \]

\[OR\] \[ E(Y|x_i,...,x_n)) = e^{\beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p} \]

We might wonder why this is any different from using a SLR with a log-transformation of the response variable. The answer is that the one of the assumptions that we have for SLR is that the error terms will have constant variance. If we use the log on the response variable, that will not be the case. However, it turns out that we can sort of force the issue if we use a “stablizing transformation” of \(\sqrt{\mu+3/8}\) and if we have large response variable values (as in greater than 5). In our example, we actually have quite large response variable values, so using the SLR with the variance standardizing factor might be fine. However, we will continue with the Posisson approach for the rest of this writeup.

Fitting a Model

Next, we will fit a standard linear model, just to see what we get. This will likely not be a great fit, but I am doing it here for comparison purposes to a Poisson model.

mdl0 <- lm(count~., data=nyc_bikes)

plt_mdl <- function(mdl) {
  par(mfrow=c(2,2))
  res <- mdl$residuals
  plot(res,mdl$data[1], xlab="Day of year", ylab="Residuals")
  plot(fitted(mdl), res, xlab="Fitted Values", ylab="Residuals")
  abline(h=0, lty=2, col="red")
  hist(res, xlab="Residuals", ylab="Frequency")
  qqnorm(res)
  qqline(res)
}
plt_mdl(mdl0)