Count Data

 

Example Codes: SAS #1 R #1

Definition

Count data is a statistical data type that the observed values are non-negative integers and are in the form of counts. The count variable is a random variable where Poisson, binomial and negative binomial distributions are commonly used to describe its distribution. Poisson regression can be used to model count variables. 

Poisson Regression

The Poisson distribution describes the probability that a random event will occur in a time or space interval when the probability of the event occurring is very small, but the number of trials is very large.

 

Model

Poisson regression models are generalized linear models with the Poisson distribution function. The log link function is commonly used in the models.

 

The Poisson probability distribution:  

 

The Poisson response variable may be modeled as:

 

Sometimes, the count responses will pertain to unequal units of time or space. In such cases, we let . Then we have:

 

Using the Log Link, we obtain:

 

 

 

Overdispersion

A characteristic of the Poisson distribution is that its mean is equal to its variance. If we see that the observed variance is greater than the mean - this is known as overdispersion. It tells us that the model is not appropriate.

 

A common reason for overdispersion is the exclusion of relevant explanatory variables.

 

Example Code in SAS

 data insure;

   input n c car$ age;

   ln = log(n);

   datalines;

500   42  small  1

1200  37  medium 1

100    1  large  1

400  101  small  2

500   73  medium 2

300   14  large  2

;

 

proc genmod data=insure;

   class car age;

   model c = car age / dist   = poisson

                       link   = log

                       offset = ln;

run;

 

Example Code in R

 

# input data

n <- c(500, 1200, 100, 400, 500, 300)

c <- c(42, 37, 1, 101, 73, 14)

car <- c("small", "medium", "large", "small", "medium", "large")

age <- c(1, 1, 1, 2, 2, 2)

 

# identify car and age as categorical variables

car <- factor(car)

age <- factor(age)

 

# create the dataframe

df <- data.frame(n, c, car, age)

 

# Poisson regression

glm(formula = c ~ car + age + n, data = df, family = poisson(link="log"))

 

 

 

 

References

1.     Cameron, A. C.; Trivedi, P. K. (2013). Regression Analysis of Count Data Book (Second ed.). Cambridge University Press. ISBN 978-1-107-66727-3.

2.     Poisson Regression. (2019, December 13). SAS® Help Center. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_ttest_syntax01.htm