Poisson Regression, Making it Count

Introduction

Poisson regression is modeling technique that can be used to predict counts. In this case, we define counts as the number of occurances of an event in a given period of time. An equavalent way to looking at counts, is to look at rates. In statistics, we frequenly use the Poisson distribution when working with counts of events, and that concept can be extended to regression. This notebook will look at a regression model for the number of bicyclists that cross a brige in New York based on predicting variables that include the day of week, the temperature (high and low), and the presence of precipitation.

The data set

The data for this model started with a kaggle bicycle challenge data set. Unfortunately, the data listed in this Kaggle challenge is not valid. In the Kaggle version of the data, the month of April is repeated several times, and the other months are not included. The data has been been corrected for this notebook using the original source at , and a comment has been posted to the Kaggle site with the corrected source data set.

We will begin by doing some initial data exploration on this corrected data set.

bikes_raw <- read.csv('NYC_Bicycle_Counts_2016_Corrected.csv', header=T, stringsAsFactors=F)
summary(bikes_raw)
##      Date                DOY            Day               HighTemp    
##  Length:214         Min.   : 92.0   Length:214         Min.   :39.90  
##  Class :character   1st Qu.:145.2   Class :character   1st Qu.:66.05  
##  Mode  :character   Median :198.5   Mode  :character   Median :78.10  
##                     Mean   :198.5                      Mean   :74.93  
##                     3rd Qu.:251.8                      3rd Qu.:84.90  
##                     Max.   :305.0                      Max.   :96.10  
##     LowTemp      Precipitation      Brooklyn.Bridge Manhattan.Bridge
##  Min.   :26.10   Length:214         Min.   : 504    Min.   : 997    
##  1st Qu.:53.23   Class :character   1st Qu.:2388    1st Qu.:3713    
##  Median :64.90   Mode  :character   Median :3076    Median :5132    
##  Mean   :61.97                      Mean   :3031    Mean   :5052    
##  3rd Qu.:71.10                      3rd Qu.:3685    3rd Qu.:6610    
##  Max.   :82.00                      Max.   :8264    Max.   :9152    
##  Williamsburg.Bridge Queensboro.Bridge     Total      
##  Min.   :1440        Min.   :1306      Min.   : 4335  
##  1st Qu.:4884        1st Qu.:3496      1st Qu.:14826  
##  Median :6334        Median :4342      Median :19002  
##  Mean   :6161        Mean   :4301      Mean   :18545  
##  3rd Qu.:7858        3rd Qu.:5308      3rd Qu.:23254  
##  Max.   :9148        Max.   :6392      Max.   :28437

We begin by creating a categorical variable for the day of the week. For this exercise, we will focus specifically on the Williamsburg bridge. As a result, we will remove some of the other columns we don’t need as well. The data now has these predictor variables:

  • doy: Day of the year
  • weekday: Day of the week
  • hightemp: The daily high temperature
  • lowtemp: The daily low temperature
  • precip: The amount (and type) of precipitation
  • count: The Target variable for this regression
bikes <- data.frame(bikes_raw)
bikes <- data.frame(bikes[c('Day','HighTemp','LowTemp', 'Precipitation','Williamsburg.Bridge')])
names(bikes) <- c('weekday','hightemp','lowtemp','precip','count')
head(bikes)

Here we can see something a potential problem with the data, namely that the precip column has some nonnumeric data in it to destinguish (T)race and (S)now types of precipitation. In order to keep things simple, and reduce the number of predictor variables, we are going to make the following adjustments… (We will assume that these bicyclists are tough and NYC bicyclists are tougher still. Otherwise, why would they be bicycling in NYC?!)

We will split the precip column into two precip_rain and precip_snow. We will put there respective values in each. For trace amounts, we will convert those values to zero.

na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}

get.snow <- function(x) {
  result <- 0
  if (grepl('(S)',x)){
    result <- as.double(gsub("[^0-9.-]", "", x))
  }
  return(result)
}

#  Easy part with rain.  This will create a warning, but we will deal with that...
nyc_bikes <- within(bikes, precip_rain<-as.double(precip))
## Warning in eval(substitute(expr), e): NAs introduced by coercion
#  Here we deal with the NAs
cat("\n\n")
nyc_bikes$precip_rain <- na.zero(nyc_bikes$precip_rain)

nyc_bikes$precip_snow <- lapply(nyc_bikes$precip, get.snow)
nyc_bikes$precip_snow <- as.numeric(nyc_bikes$precip_snow)
nyc_bikes <- data.frame(nyc_bikes[c('weekday','hightemp','lowtemp','precip_rain','precip_snow','count')])

# Finally we will convert weekday to a categorical variable.
nyc_bikes$weekday <- as.factor(nyc_bikes$weekday)
head(nyc_bikes)
cat("\n")
summary(nyc_bikes)
##       weekday      hightemp        lowtemp       precip_rain    
##  Friday   :31   Min.   :39.90   Min.   :26.10   Min.   :0.0000  
##  Monday   :31   1st Qu.:66.05   1st Qu.:53.23   1st Qu.:0.0000  
##  Saturday :31   Median :78.10   Median :64.90   Median :0.0000  
##  Sunday   :31   Mean   :74.93   Mean   :61.97   Mean   :0.1069  
##  Thursday :30   3rd Qu.:84.90   3rd Qu.:71.10   3rd Qu.:0.0400  
##  Tuesday  :30   Max.   :96.10   Max.   :82.00   Max.   :1.6500  
##  Wednesday:30                                                   
##   precip_snow           count     
##  Min.   :0.000000   Min.   :1440  
##  1st Qu.:0.000000   1st Qu.:4884  
##  Median :0.000000   Median :6334  
##  Mean   :0.002196   Mean   :6161  
##  3rd Qu.:0.000000   3rd Qu.:7858  
##  Max.   :0.470000   Max.   :9148  
## 
write.csv(nyc_bikes, "nyc_bike_counts.csv", row.names = F)

Baring any further cleaning, we will turn our attention to modeling.

Let’s continue our exploratory analysis by looking at the distribution of the response variable based on the categorical variable, weekday. We will do that with the help of the MASS library in R.

library(MASS)
nyc_bikes$weekday <- factor(nyc_bikes$weekday, levels=c('Sunday', 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
boxplot(count~weekday, xlab="Weekday", ylab="Count", data=nyc_bikes)

Taking a look at the quantitative variables, hightemp, lowtemp, precip_rain, precip_snow and count, we have the following:

nyc_bikes_quant=data.frame(nyc_bikes[c('hightemp','lowtemp','precip_rain','precip_snow','count')])
plot(nyc_bikes_quant)

Here we can see some pretty strking multicollinearities, particularly around temps. Also, it is pretty clear that the data is cyclic in terms of temps. We could, if we wanted to, try to do some seasonality reduction for the temperature variables. Also, we note that the snow and rain are mutually exclusive. As part of this analysis, we will see which variables are adding value, and may choose to remove those later. Also, we notice a couple of outliers for snow. We will come back to those concerns later.

Mathematical Preliminaries

In Poisson regression, as we mentioned before, we attempt to model a response variable that is a rate based on predictor variables. This is special case of regression. It may be tempting to just use a regular regresison problem in this case, however Poisson regression does have some specific advantages.

When we model data with regression, we make certain assumptions. Those assumptions allow us to make statistical statements about our model and predictions. For standard linear regression (SLR), we make the following assumptions:

Standard Linear Regression:

\[ Y_i=\beta_0+ \beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}+\epsilon_i, i=1,...,n\] The assumptions that we make have to do with the error term \(\epsilon\) above. Specifically…

  • Linearity/Mean Zero: In other words, the expected value of the error terms in our model is zero \(E(\epsilon_i)=0\).
  • Constan Variance: The variance of the error terms is constant e.g. \(Var(\epsilon_i)=\sigma^2\)
  • Independence: {\(\epsilon_1,...,\epsilon_n\)} are independent random variables.
  • Normality: The errors are normally distributed: \(\epsilon_i\sim N(\mu,\sigma^2)\)

In our case, however, the response variable does NOT follow a normal distribution. Ok. That is fine, and it gives rise to the Generalized Linear Model…

\[g(E(Y|x_i,...,x_p))=\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p}\] \[OR\]

\[E(Y|x_i,...,x_p)=g^{-1}(\beta_0+\beta_1x_{i,1}+ \beta_2x_{i,2} +...+ \beta_px_{i,p})\] Where \(g()\) is a “link” function and \(g^{-1}()\) is the inverse of that function. We pick the link function based ont he distribution of Y. In this case we pick the link function based on the fact that Y above belongs to a family of “exponential” distribution. In the case where our response data is distributed as a Poission random variable, we use the following approach:

\[ ln(E(Y|x_i,...,x_n)) = \beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p \]

\[OR\] \[ E(Y|x_i,...,x_n)) = e^{\beta_0+\beta_1x_1 + \beta_2x_2 +...+ \beta_px_p} \]

We might wonder why this is any different from using a SLR with a log-transformation of the response variable. The answer is that the one of the assumptions that we have for SLR is that the error terms will have constant variance. If we use the log on the response variable, that will not be the case. However, it turns out that we can sort of force the issue if we use a “stablizing transformation” of \(\sqrt{\mu+3/8}\) and if we have large response variable values (as in greater than 5). In our example, we actually have quite large response variable values, so using the SLR with the variance standardizing factor might be fine. However, we will continue with the Posisson approach for the rest of this writeup.

Fitting a Model

Next, we will fit a standard linear model, just to see what we get. This will likely not be a great fit, but I am doing it here for comparison purposes to a Poisson model.

mdl0 <- lm(count~., data=nyc_bikes)

plt_mdl <- function(mdl) {
  par(mfrow=c(2,2))
  res <- mdl$residuals
  plot(res,mdl$data[1], xlab="Day of year", ylab="Residuals")
  plot(fitted(mdl), res, xlab="Fitted Values", ylab="Residuals")
  abline(h=0, lty=2, col="red")
  hist(res, xlab="Residuals", ylab="Frequency")
  qqnorm(res)
  qqline(res)
}
plt_mdl(mdl0)

summary(mdl0)
## 
## Call:
## lm(formula = count ~ ., data = nyc_bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2905.1  -521.7   137.0   619.2  1918.9 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -867.51     444.39  -1.952  0.05230 .  
## weekdayMonday     2045.56     252.87   8.089 5.38e-14 ***
## weekdayTuesday    2632.15     252.61  10.420  < 2e-16 ***
## weekdayWednesday  2861.68     252.69  11.325  < 2e-16 ***
## weekdayThursday   2415.68     251.93   9.589  < 2e-16 ***
## weekdayFriday     1756.22     250.51   7.011 3.44e-11 ***
## weekdaySaturday    401.70     249.46   1.610  0.10889    
## hightemp           125.14      14.12   8.863 3.98e-16 ***
## lowtemp            -61.30      15.20  -4.034 7.75e-05 ***
## precip_rain      -2358.20     270.75  -8.710 1.07e-15 ***
## precip_snow      -6866.81    2157.81  -3.182  0.00169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 981.5 on 203 degrees of freedom
## Multiple R-squared:  0.7485, Adjusted R-squared:  0.7361 
## F-statistic: 60.41 on 10 and 203 DF,  p-value: < 2.2e-16

Let’s compare the above with a poisson regression on the data…

head(nyc_bikes)
mdl1 <- glm(count~., family="poisson", data=nyc_bikes)
summary(mdl1)
## 
## Call:
## glm(formula = count ~ ., family = "poisson", data = nyc_bikes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -39.545   -7.662    0.784    8.717   27.325  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       7.4386802  0.0064331 1156.32   <2e-16 ***
## weekdayMonday     0.3927792  0.0035492  110.67   <2e-16 ***
## weekdayTuesday    0.4802382  0.0035055  137.00   <2e-16 ***
## weekdayWednesday  0.4990051  0.0034490  144.68   <2e-16 ***
## weekdayThursday   0.4374520  0.0034980  125.06   <2e-16 ***
## weekdayFriday     0.3420589  0.0035907   95.26   <2e-16 ***
## weekdaySaturday   0.0952104  0.0037565   25.35   <2e-16 ***
## hightemp          0.0224908  0.0001900  118.40   <2e-16 ***
## lowtemp          -0.0113279  0.0002006  -56.47   <2e-16 ***
## precip_rain      -0.4931333  0.0043281 -113.94   <2e-16 ***
## precip_snow      -2.5020266  0.0565319  -44.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 138456  on 213  degrees of freedom
## Residual deviance:  35709  on 203  degrees of freedom
## AIC: 37979
## 
## Number of Fisher Scoring iterations: 4

One thing we note above is that the normality improved considerably! In fact, this model looks TOO good based on the statistical significance of the coefficients. To see why, let’s see if we can figure out why…

Model Interpretation

Let’s look at the coefficients of our model…

coef(mdl1)
##      (Intercept)    weekdayMonday   weekdayTuesday weekdayWednesday 
##       7.43868024       0.39277921       0.48023823       0.49900506 
##  weekdayThursday    weekdayFriday  weekdaySaturday         hightemp 
##       0.43745198       0.34205894       0.09521040       0.02249084 
##          lowtemp      precip_rain      precip_snow 
##      -0.01132793      -0.49313335      -2.50202655

Unlike SLR, where we can simply interpret the coefficients as the respecitve change in the response variable for one unit change in the predicting variable, holding all other predictors constant, interpreting the coefficients in poisson regression requires a bit more work.

For the hightemp value, the expected count of the number of bike rides per day for one unit increase in hightemp would be: \[exp(0.0224908)=1.0227457\]

So… what gives here?! These numbers are just in no way correct. To account for the crazy results, let’s scale our response variable.

If we mess around with scale of the response variable, we might have more luck. In the course notes, the instructor recommended Poisson regression for data sets with “small” numbers of counts. Here we will scale the response variable to be roughly between 0 and 10.

nyc_bikes_scaled <- data.frame(nyc_bikes)
nyc_bikes_scaled$count<- as.integer(nyc_bikes_scaled$count/1000)
head(nyc_bikes_scaled)
par(mfrow=c(1,2))
hist(nyc_bikes$count)
hist(nyc_bikes_scaled$count)

As we can see above, this slightly messed with the distribution of our data, but we will go with it.

We can also see how well a Poisson distribution fits our response variable…

To see how well a Poisson distribution could fit this data, we can use the vcd library.

library(vcd)
## Loading required package: grid
gf <- goodfit(nyc_bikes_scaled$count, "poisson")
plot(gf, type="standing", scale="raw")

Let’s continue by fitting another model, this time to our scaled data.

mdl2 <- glm(count~., family="poisson", data=nyc_bikes_scaled)
summary(mdl2)
## 
## Call:
## glm(formula = count ~ ., family = "poisson", data = nyc_bikes_scaled)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.49970  -0.27966   0.03794   0.29435   1.05056  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.332918   0.214496   1.552 0.120639    
## weekdayMonday     0.439652   0.119159   3.690 0.000225 ***
## weekdayTuesday    0.549458   0.117287   4.685  2.8e-06 ***
## weekdayWednesday  0.559255   0.115551   4.840  1.3e-06 ***
## weekdayThursday   0.513558   0.116815   4.396  1.1e-05 ***
## weekdayFriday     0.403540   0.120223   3.357 0.000789 ***
## weekdaySaturday   0.128917   0.126329   1.020 0.307498    
## hightemp          0.024063   0.006285   3.829 0.000129 ***
## lowtemp          -0.012238   0.006629  -1.846 0.064850 .  
## precip_rain      -0.541233   0.145774  -3.713 0.000205 ***
## precip_snow      -3.039662   2.141074  -1.420 0.155698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 153.596  on 213  degrees of freedom
## Residual deviance:  42.407  on 203  degrees of freedom
## AIC: 819.91
## 
## Number of Fisher Scoring iterations: 4

This is looking much more better from a model pespective. What we did here was scale the output count to fall between 1 and 10.

Before we turn our attention to specifically how to account for changes in the rate, let’s consider the following simple model:

\[ ln(\lambda_0) = \beta_0 + \beta_1x_0 \]

Now, lets define a value \(\lambda_1\) and set it equal to the result of adding 1 unit to \(x_0\):

\[ ln(\lambda_1) = \beta_0 + \beta_1(x_0+1) \]

Now, if we expand out the right side, that gives us:

\[\ln(\lambda_1) = \beta_0 + \beta_0x_0 + \beta_1\]

Now, we exponentiate both sides…

\[ e^{\ln(\lambda_1)} = e^{\beta_0 + \beta_0x_0 + \beta_1} = e^{ln(\lambda_o)\beta_1} = \lambda_0e^{\beta_1}\] Which gives us:

\[\lambda_1=\lambda_0e^{\beta_1}\]

So… what all that means is that for 1 unit change in the coefficient \(\beta_1\), the response variable changes by a factor of \(e^{\beta_1}\).

If we now turn our attention to the coefficient for hightemp, we have the following:

\[exp(0.0240626)=1.0243544\]

So we’d expect 1.0243544 times more riders for 1 unit increase in temperature holding all other predictors constant.

Simillarly, if we take a look at precip_rain, we can see that:

\[exp(-0.541233)=0.5820302\] We’d expect expect riders to decrease by a factor of 0.5820302 for each 1 unit increase in rain holding all other predictors constant.

Modeling assumptions

For Poisson regression, as with logistic regression, we don’t have residuals to analyze. As a result, we use an approximation of the residuals using either using Pearson’s residuals or using the deviance residuals.

First we look at the linearity assumption

rsid <- residuals(mdl2)
d <- deviance(mdl2)
par(mfrow=c(2,2))
plot(nyc_bikes_scaled$hightemp, rsid)
boxplot(rsid~weekday,xlab="Weekday",ylab = "Std residuals",data = nyc_bikes_scaled)
qqnorm(rsid, ylab="Std residuals")
qqline(rsid,col="blue",lwd=2)
hist(rsid)

We can also look at the dispersion of our model.

Based on the above, things don’t look too bad. The data appears to be somewhat normal, and the residuals across the different days appear to be linear. There is some concern here about the independence of the data, however, particularly in regards to the temps. The data is distributed across the x axis, but shows a definite pattern.

The dispersion parameter for this model is:

\[\hat\phi = \frac{D}{n-p-1}\]

n=length(mdl2$fitted.values)
p = 10 # 10 Predictor Variables.
D = sum(residuals(mdl2, type="deviance")^2)
phi = D/(n-p-1)
print(phi)
## [1] 0.2088994

Based on this value of 0.2, our model does not appear to be overdispersed.

Statistical Inference

Assuming that our model assumtions hold, which the kinda do, we can look at some statistical measures for our model. We are primarily interested in the following:

  1. Confidence intervals about our predictors.
  2. Assessing the overall statistical significance of the model.

We begin by looking at confidence intervals. The code below uses the “sandwich” library in R to create “robust standard errors”. This approach is based on work done by Cameron and Trivedi (2009). For more information on this approach, see this source from UCLA

library(sandwich)
cov.mdl2 <- vcovHC(mdl2, type="HC0")
std.err <- sqrt(diag(cov.mdl2))
r.est <- cbind(Estimate= coef(mdl2), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(mdl2)/std.err), lower.tail=FALSE),
LL = coef(mdl2) - 1.96 * std.err,
UL = coef(mdl2) + 1.96 * std.err)
r.est
##                     Estimate   Robust SE      Pr(>|z|)          LL           UL
## (Intercept)       0.33291764 0.102549882  1.168757e-03  0.13191987  0.533915408
## weekdayMonday     0.43965202 0.049354814  5.196767e-19  0.34291659  0.536387458
## weekdayTuesday    0.54945848 0.047148605  2.195557e-31  0.45704721  0.641869742
## weekdayWednesday  0.55925477 0.042112299  3.020645e-40  0.47671466  0.641794877
## weekdayThursday   0.51355810 0.044141133  2.753372e-31  0.42704148  0.600074723
## weekdayFriday     0.40353980 0.047748321  2.878489e-17  0.30995309  0.497126509
## weekdaySaturday   0.12891662 0.055420633  2.001063e-02  0.02029218  0.237541066
## hightemp          0.02406258 0.002510306  9.199658e-22  0.01914238  0.028982778
## lowtemp          -0.01223831 0.002508518  1.067867e-06 -0.01715500 -0.007321615
## precip_rain      -0.54123302 0.065603294  1.582764e-16 -0.66981548 -0.412650569
## precip_snow      -3.03966240 0.113697878 1.867213e-157 -3.26251024 -2.816814559

Here we can see the 95% confidence intervals for the various predictors.

To assess the overall statistical of the regression model. This is done by with the following hypothesis test:

\[ H_0: \beta-1=\beta_2=...\beta_p=0 \] To perform this test, we compute the test statistic like so.

summary(mdl2)
## 
## Call:
## glm(formula = count ~ ., family = "poisson", data = nyc_bikes_scaled)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.49970  -0.27966   0.03794   0.29435   1.05056  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.332918   0.214496   1.552 0.120639    
## weekdayMonday     0.439652   0.119159   3.690 0.000225 ***
## weekdayTuesday    0.549458   0.117287   4.685  2.8e-06 ***
## weekdayWednesday  0.559255   0.115551   4.840  1.3e-06 ***
## weekdayThursday   0.513558   0.116815   4.396  1.1e-05 ***
## weekdayFriday     0.403540   0.120223   3.357 0.000789 ***
## weekdaySaturday   0.128917   0.126329   1.020 0.307498    
## hightemp          0.024063   0.006285   3.829 0.000129 ***
## lowtemp          -0.012238   0.006629  -1.846 0.064850 .  
## precip_rain      -0.541233   0.145774  -3.713 0.000205 ***
## precip_snow      -3.039662   2.141074  -1.420 0.155698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 153.596  on 213  degrees of freedom
## Residual deviance:  42.407  on 203  degrees of freedom
## AIC: 819.91
## 
## Number of Fisher Scoring iterations: 4
1-pchisq((153.596-42.407),(213-203))
## [1] 0

Based on the above, we can see that the p-value is 0, so we reject the null hypothesis that the regression coefficients are all zero. Hence our model is statistically significant.

Goodness of Fit

In order to assess goodness of fit, we will lood at the deviance residuals, and apply a \(\chi^2_{df}\) test based on our n-p-1 degrees of freedom. In this case, our p-value is large, and so we reject the null hypothesis that the model is not a good fit.

df <- n-p-1
## Test for GOF: Using deviance residuals
deviances2 <- residuals(mdl2,type="deviance")
dev.tvalue <- sum(deviances2^2)
c(dev.tvalue, 1-pchisq(dev.tvalue,df))
## [1] 42.40658  1.00000
#OR
c(deviance(mdl2), 1-pchisq(deviance(mdl2),df))
## [1] 42.40658  1.00000

Prediction

All of this is leading us to the point where we can make a prediction with our model. Let’s say that it is a Saturday, with a high temperature of 55 and a low temperature of 28 with no precipitation. What would our model predict for a count of bikers on our scalled bridge?

Drum roll please…

new_data <- data.frame(weekday='Saturday', hightemp=55, lowtemp=28, precip_rain=0, precip_snow=0)
new_data$weekday <- as.factor(new_data$weekday)

pred <- predict(mdl2, newdata = new_data, interval=c('prediction'), lvl=0.9, type="response")
pred
##        1 
## 4.231698

Tada! That sounds somewhat reasonable.

Conclusion

There were a number of challenges in this data set including the fact that the counts were very high, which might make this not such a great model for candidate for regression. I addressed this is a bit by forcing the scale of the response variable down into a smaller range, and this allowed me to move forward with various metrics and tests in the data. The main area of concern for the data was in the distribution of the residuals for high temperature vs residuals. This lead to some concerns over independence, but we may never know considering the fact that the data was sampled outside of this write up.

In summary, I think this writeup has done a reasonable job of detailing the theory behind basic Poisson regression, but was less that a complete success with it came to actually generating a successful model. Oh well. I guess that is why they call it data science. Some experiments are more successful than others.