Poisson Regression
Poisson Regression, Making it Count
Miles R. Porter
March 7, 2020
Introduction
Poisson regression is modeling technique that can be used to predict counts. In this case, we define counts as the number of occurances of an event in a given period of time. An equavalent way to looking at counts, is to look at rates. In statistics, we frequenly use the Poisson distribution when working with counts of events, and that concept can be extended to regression. This notebook will look at a regression model for the number of bicyclists that cross a brige in New York based on predicting variables that include the day of week, the temperature (high and low), and the presence of precipitation.
The data set
The data for this model started with a kaggle bicycle challenge data set. Unfortunately, the data listed in this Kaggle challenge is not valid. In the Kaggle version of the data, the month of April is repeated several times, and the other months are not included. The data has been been corrected for this notebook using the original source at , and a comment has been posted to the Kaggle site with the corrected source data set.
We will begin by doing some initial data exploration on this corrected data set.
bikes_raw <- read.csv('NYC_Bicycle_Counts_2016_Corrected.csv', header=T, stringsAsFactors=F)
summary(bikes_raw)
## Date DOY Day HighTemp
## Length:214 Min. : 92.0 Length:214 Min. :39.90
## Class :character 1st Qu.:145.2 Class :character 1st Qu.:66.05
## Mode :character Median :198.5 Mode :character Median :78.10
## Mean :198.5 Mean :74.93
## 3rd Qu.:251.8 3rd Qu.:84.90
## Max. :305.0 Max. :96.10
## LowTemp Precipitation Brooklyn.Bridge Manhattan.Bridge
## Min. :26.10 Length:214 Min. : 504 Min. : 997
## 1st Qu.:53.23 Class :character 1st Qu.:2388 1st Qu.:3713
## Median :64.90 Mode :character Median :3076 Median :5132
## Mean :61.97 Mean :3031 Mean :5052
## 3rd Qu.:71.10 3rd Qu.:3685 3rd Qu.:6610
## Max. :82.00 Max. :8264 Max. :9152
## Williamsburg.Bridge Queensboro.Bridge Total
## Min. :1440 Min. :1306 Min. : 4335
## 1st Qu.:4884 1st Qu.:3496 1st Qu.:14826
## Median :6334 Median :4342 Median :19002
## Mean :6161 Mean :4301 Mean :18545
## 3rd Qu.:7858 3rd Qu.:5308 3rd Qu.:23254
## Max. :9148 Max. :6392 Max. :28437
We begin by creating a categorical variable for the day of the week. For this exercise, we will focus specifically on the Williamsburg bridge. As a result, we will remove some of the other columns we don’t need as well. The data now has these predictor variables:
- doy: Day of the year
- weekday: Day of the week
- hightemp: The daily high temperature
- lowtemp: The daily low temperature
- precip: The amount (and type) of precipitation
- count: The Target variable for this regression
bikes <- data.frame(bikes_raw)
bikes <- data.frame(bikes[c('Day','HighTemp','LowTemp', 'Precipitation','Williamsburg.Bridge')])
names(bikes) <- c('weekday','hightemp','lowtemp','precip','count')
head(bikes)
Here we can see something a potential problem with the data, namely that the precip column has some nonnumeric data in it to destinguish (T)race and (S)now types of precipitation. In order to keep things simple, and reduce the number of predictor variables, we are going to make the following adjustments… (We will assume that these bicyclists are tough and NYC bicyclists are tougher still. Otherwise, why would they be bicycling in NYC?!)
We will split the precip column into two precip_rain and precip_snow. We will put there respective values in each. For trace amounts, we will convert those values to zero.
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
get.snow <- function(x) {
result <- 0
if (grepl('(S)',x)){
result <- as.double(gsub("[^0-9.-]", "", x))
}
return(result)
}
# Easy part with rain. This will create a warning, but we will deal with that...
nyc_bikes <- within(bikes, precip_rain<-as.double(precip))
## Warning in eval(substitute(expr), e): NAs introduced by coercion
# Here we deal with the NAs
cat("\n\n")
nyc_bikes$precip_rain <- na.zero(nyc_bikes$precip_rain)
nyc_bikes$precip_snow <- lapply(nyc_bikes$precip, get.snow)
nyc_bikes$precip_snow <- as.numeric(nyc_bikes$precip_snow)
nyc_bikes <- data.frame(nyc_bikes[c('weekday','hightemp','lowtemp','precip_rain','precip_snow','count')])
# Finally we will convert weekday to a categorical variable.
nyc_bikes$weekday <- as.factor(nyc_bikes$weekday)
head(nyc_bikes)
cat("\n")
summary(nyc_bikes)
## weekday hightemp lowtemp precip_rain
## Friday :31 Min. :39.90 Min. :26.10 Min. :0.0000
## Monday :31 1st Qu.:66.05 1st Qu.:53.23 1st Qu.:0.0000
## Saturday :31 Median :78.10 Median :64.90 Median :0.0000
## Sunday :31 Mean :74.93 Mean :61.97 Mean :0.1069
## Thursday :30 3rd Qu.:84.90 3rd Qu.:71.10 3rd Qu.:0.0400
## Tuesday :30 Max. :96.10 Max. :82.00 Max. :1.6500
## Wednesday:30
## precip_snow count
## Min. :0.000000 Min. :1440
## 1st Qu.:0.000000 1st Qu.:4884
## Median :0.000000 Median :6334
## Mean :0.002196 Mean :6161
## 3rd Qu.:0.000000 3rd Qu.:7858
## Max. :0.470000 Max. :9148
##
write.csv(nyc_bikes, "nyc_bike_counts.csv", row.names = F)
Baring any further cleaning, we will turn our attention to modeling.
Let’s continue our exploratory analysis by looking at the distribution of the response variable based on the categorical variable, weekday. We will do that with the help of the MASS library in R.
library(MASS)
nyc_bikes$weekday <- factor(nyc_bikes$weekday, levels=c('Sunday', 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
boxplot(count~weekday, xlab="Weekday", ylab="Count", data=nyc_bikes)
Taking a look at the quantitative variables, hightemp, lowtemp, precip_rain, precip_snow and count, we have the following:
nyc_bikes_quant=data.frame(nyc_bikes[c('hightemp','lowtemp','precip_rain','precip_snow','count')])
plot(nyc_bikes_quant)
Here we can see some pretty strking multicollinearities, particularly around temps. Also, it is pretty clear that the data is cyclic in terms of temps. We could, if we wanted to, try to do some seasonality reduction for the temperature variables. Also, we note that the snow and rain are mutually exclusive. As part of this analysis, we will see which variables are adding value, and may choose to remove those later. Also, we notice a couple of outliers for snow. We will come back to those concerns later.
Mathematical Preliminaries
In Poisson regression, as we mentioned before, we attempt to model a response variable that is a rate based on predictor variables. This is special case of regression. It may be tempting to just use a regular regresison problem in this case, however Poisson regression does have some specific advantages.
When we model data with regression, we make certain assumptions. Those assumptions allow us to make statistical statements about our model and predictions. For standard linear regression (SLR), we make the following assumptions:
Standard Linear Regression:
\[ Y_i=\beta_0+ \beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}+\epsilon_i, i=1,...,n\] The assumptions that we make have to do with the error term \(\epsilon\) above. Specifically…
- Linearity/Mean Zero: In other words, the expected value of the error terms in our model is zero \(E(\epsilon_i)=0\).
- Constan Variance: The variance of the error terms is constant e.g. \(Var(\epsilon_i)=\sigma^2\)
- Independence: {\(\epsilon_1,...,\epsilon_n\)} are independent random variables.
- Normality: The errors are normally distributed: \(\epsilon_i\sim N(\mu,\sigma^2)\)
In our case, however, the response variable does NOT follow a normal distribution. Ok. That is fine, and it gives rise to the Generalized Linear Model…
\[g(E(Y|x_i,...,x_p))=\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}\] \[OR\]
\[E(Y|x_i,...,x_p)=g^{-1}(\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p})\] Where \(g()\) is a “link” function and \(g^{-1}()\) is the inverse of that function. We pick the link function based ont he distribution of Y. In this case we pick the link function based on the fact that Y above belongs to a family of “exponential” distribution. In the case where our response data is distributed as a Poission random variable, we use the following approach:
\[ ln(E(Y|x_i,...,x_n)) = \beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p \]
\[OR\] \[ E(Y|x_i,...,x_n)) = e^{\beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p} \]
We might wonder why this is any different from using a SLR with a log-transformation of the response variable. The answer is that the one of the assumptions that we have for SLR is that the error terms will have constant variance. If we use the log on the response variable, that will not be the case. However, it turns out that we can sort of force the issue if we use a “stablizing transformation” of \(\sqrt{\mu+3/8}\) and if we have large response variable values (as in greater than 5). In our example, we actually have quite large response variable values, so using the SLR with the variance standardizing factor might be fine. However, we will continue with the Posisson approach for the rest of this writeup.
Fitting a Model
Next, we will fit a standard linear model, just to see what we get. This will likely not be a great fit, but I am doing it here for comparison purposes to a Poisson model.
mdl0 <- lm(count~., data=nyc_bikes)
plt_mdl <- function(mdl) {
par(mfrow=c(2,2))
res <- mdl$residuals
plot(res,mdl$data[1], xlab="Day of year", ylab="Residuals")
plot(fitted(mdl), res, xlab="Fitted Values", ylab="Residuals")
abline(h=0, lty=2, col="red")
hist(res, xlab="Residuals", ylab="Frequency")
qqnorm(res)
qqline(res)
}
plt_mdl(mdl0)