Binary
Example Codes:
Comparison: SAS #1
Comparison: R #1
Comparison: SAS #2
Comparison: R #2
Prediction: SAS #1
Prediction: R #1
Prediction: SAS #2
Prediction: R #2
Objectives
If your response variable is binary and you want to
analyze the relationship between the binary variable and other variables,
logistic regression can deal with that.
Assume the binary variable, y, only contains two values; 0 and 1.
Here is the logistic regression’s expression:
So, the outcome of logistic regression is the probability of getting the
response variable equal to 1.
The is the odds ratio.
General Example Code in SAS
PROC LOGISTIC DATA=SAS-data-set ;
CLASS variables ;
MODEL response=predictors
;
UNITS independent1=list;
ODDSRATIO <‘label’> variable ;
OUTPUT OUT=SAS-data-set keyword=name ;
RUN;
CLASS names the classification variables to be used in
the analysis. The CLASS statement must precede the MODEL statement. By default,
these variables will be analyzed using effects coding parameterization. This
can be changed with the PARAM= option.
MODEL specifies the response variable and the
predictor variables.
OUTPUT creates an output data set containing all the
variables from the input data set and any requested statistics.
UNITS enables you to obtain an odds ratio estimate for
a specified change in a predictor variable. The unit of change can be a number,
standard deviation (SD), or a number times the standard deviation (for example,
2*SD).
ODDSRATIO produces odds ratios for variables even when
the variables are involved in interactions with other covariates, and for
classification variables that use any parameterization. You can specify several
ODDSRATIO statements.
Outputs
The Model Information table describes the data set,
the response variable, the number of response levels, the type of model, the
algorithm used to obtain the parameter estimates, and the number of observations
read and used.
The Number of Observations Used is the count of all
observations that are nonmissing for all variables
specified in the MODEL statement. The ages of 263 of these 1309 passengers
cannot be determined and cannot be used to estimate the model.
The Response Profile table shows the response variable
values listed according to their ordered values. By default, PROC LOGISTIC
orders the response variable alphanumerically so that it bases the logistic
regression model on the probability of the smallest value. Because you used the
EVENT=option in this example, the model is based on the probability of
surviving (Survived=1). The Response Profile table also shows frequencies of
response values.
The Model Fit Statistics provides three tests
1. AIC is Akaike’s ‘A’ information criterion.
2. SC is the Schwarz
criterion.
3. -2 Log L is -2 times the
natural log of the likelihood.
-2 Log L, AIC, and SC are goodness-of-fit measures
that you can use to compare one model to another. These statistics measure
relative fit among models, but they do not measure absolute fit of any single
model. Smaller values for all of these measures
indicate a better fit.
The Testing Global Null Hypothesis: BETA=0 table
provides three statistics to test the null hypothesis that all regression
coefficients of the model are 0.
The Analysis of Maximum Likelihood Estimates table
lists the estimated model parameters, their standard errors, Wald Chi-Square
values, and p-values.
This information can be expressed in the below equation:
The Wald chi-square and its associated p-value test whether
the parameter estimate is significantly different from 0. For this example, the
p-values for the variable Age are not significant at the 0.05 significance
level (p=0.0696). It cannot be concluded that Age is not important in a
multivariate model. The
odds ratio and confidence interval can be found in the following table:
Multinomial
Objectives
If your response variable is multinomial and you
want to analyze the relationship between the multinomial variable and other
variables, you may use multinomial logistic regression.
Multinomial logistic regression predicts the
probability that the response variable falls into
th
category. If there are
outcomes for the response variable
,
then the whole model can be defined as a set of
binary logistic regression. Suppose you have
observed data
,
and fitted coefficients
. The following equation
represents the
th logistic model for
predicting the probability of the
th outcome. If outcome
(the last outcome) is chosen as the pivot:
Where:
Prediction: Example Code in SAS
data travel;
input AutoTime PlanTime TranTime Age Sex Chosen $;
datalines;
10.0 4.5
10.5 32 1
Plane
5.5 4.0
7.5 13 1
Auto
4.5 6.0
5.5 41 1
Transit
3.5 2.0
5.0 41 1
Transit
1.5 4.5
4.0 47 1
Auto
10.5 3.0
10.5 24 1
Plane
7.0 3.0
9.0 27 1
Auto
9.0 3.5
9.0 21 1
Plane
4.0 5.0
5.5 23 1
Auto
22.0 4.5
22.5 30 0
Plane
7.5 5.5
10.0 58 0
Plane
11.5 3.5
11.5 36 0
Transit
3.5 4.5
4.5 43 0
Auto
12.0 3.0
11.0 33 0
Plane
18.0 5.5
20.0 30 0
Plane
23.0 5.5
21.5 28 0
Plane
4.0 3.0
4.5 44 0
Plane
5.0 2.5
7.0 37 0
Transit
3.5 2.0
7.0 45 0
Auto
12.5 3.5
15.5 35 0 Plane
1.5 4.0
2.0 22 0 Auto
;
proc logistic data = travel;
class chosen;
model chosen = age / link = glogit;
run;
Prediction: Example Code in R
install.packages("nnet")
library(nnet)
travel = data.frame(
AutoTime
= c(10,5.5,4.5,3.5,1.5,10.5,7,9,4,22,7.5,11.5,3.5,12,18,23,4,5,3.5,12.5,1.5),
PlanTime
= c(4.5,4,6,2,4.5,3,3,3.5,5,4.5,5.5,3.5,4.5,3,5.5,5.5,3,2.5,2,3.5,4),
TranTime
= c(10.5,7.5,5.5,5,4,10.5,9,9,5.5,22.5,10,11.5,4.5,11,20,21.5,4.5,7,7,15.5,2),
Age =
c(32,13,41,41,47,24,27,21,23,30,58,36,43,33,30,28,44,37,45,35,22),
Sex =
c(rep("female",10), rep("male",11)),
Chosen =
c("Plane", "Auto", "Transit",
"Transit", "Auto", "Plane", "Auto",
"Plane", "Auto",
"Plane", "Plane", "Transit",
"Auto", "Plane", "Plane", "Plane",
"Plane",
"Transit",
"Auto", "Plane", "Auto")
)
test <- multinom(Chosen ~ Age,
data = travel)
summary(test)
# Predicted probability
head(pp <- fitted(test))
Prediction: Example Code in SAS
data titanic;
set "C:\path\titanic"; /* insert correct path to
file here */
run;
proc logistic data=titanic alpha=.05 plots(only)=(effect oddsratio);
model Survived(event='1')=Age / clodds=pl;
run;
Prediction: Example Code in R
# Loading data
mydata <-
read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
## view the first few rows of the data
head(mydata)
# Train test split
mydata$Survived <- factor(mydata$Survived)
train = mydata[1:700,]
test = mydata[701:887,]
# Fitting model
mylogit <- glm(Survived ~ Age, data =
train, family = "binomial")
summary(mylogit)
# Prediction
pred_data = predict(mylogit, newdata = test, type =
"response")
## Accuracy:
sum(ifelse(pred_data > 0.5, 1, 0) == test$Survived)/length(test$Survived)
Comparison: Example
Code in SAS
The data needed in this code can be
downloaded here
data titanic;
set "C:\path\titanic"; /* insert correct path to
file here */
run;
proc logistic data=titanic alpha=.05 plots(only)=(effect oddsratio);
class female / param=reference;
model Survived(event='1')=Age
female / clodds=pl;
run;
Comparison: Example Code in R
# Loading data
mydata <-
read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
## view the first few rows of the data
head(mydata)
# Fitting model
mydata$Survived <-
factor(mydata$Survived)
mylogit <- glm(Survived
~ Age + Sex, data = mydata, family =
"binomial")
summary(mylogit)
Comparison: Example Code in SAS
data travel;
input AutoTime PlanTime TranTime Age Sex Chosen $;
datalines;
10.0 4.5
10.5 32 1
Plane
5.5 4.0
7.5 13 1
Auto
4.5 6.0
5.5 41
1 Transit
3.5 2.0
5.0 41 1
Transit
1.5 4.5
4.0 47 1
Auto
10.5 3.0
10.5 24 1
Plane
7.0 3.0
9.0 27 1
Auto
9.0 3.5
9.0 21 1
Plane
4.0 5.0
5.5 23 1 Auto
22.0 4.5
22.5 30 0
Plane
7.5 5.5
10.0 58 0
Plane
11.5 3.5
11.5 36 0
Transit
3.5 4.5
4.5 43 0
Auto
12.0 3.0
11.0 33 0
Plane
18.0 5.5
20.0 30 0 Plane
23.0 5.5
21.5 28 0
Plane
4.0 3.0
4.5 44 0
Plane
5.0 2.5
7.0 37 0
Transit
3.5 2.0
7.0 45 0
Auto
12.5 3.5
15.5 35 0 Plane
1.5 4.0
2.0 22 0 Auto
;
proc logistic data = travel;
model chosen = age Sex / link = glogit;
run;
Comparison: Example Code in R
install.packages("nnet")
library(nnet)
travel = data.frame(
AutoTime
= c(10,5.5,4.5,3.5,1.5,10.5,7,9,4,22,7.5,11.5,3.5,12,18,23,4,5,3.5,12.5,1.5),
PlanTime
= c(4.5,4,6,2,4.5,3,3,3.5,5,4.5,5.5,3.5,4.5,3,5.5,5.5,3,2.5,2,3.5,4),
TranTime
= c(10.5,7.5,5.5,5,4,10.5,9,9,5.5,22.5,10,11.5,4.5,11,20,21.5,4.5,7,7,15.5,2),
Age = c(32,13,41,41,47,24,27,21,23,30,58,36,43,33,30,28,44,37,45,35,22),
Sex =
c(rep("female",10), rep("male",11)),
Chosen = c("Plane", "Auto", "Transit",
"Transit", "Auto", "Plane", "Auto",
"Plane", "Auto",
"Plane", "Plane", "Transit",
"Auto", "Plane", "Plane", "Plane",
"Plane",
"Transit", "Auto", "Plane",
"Auto")
)
test <- multinom(Chosen ~ Age + Sex, data = travel)
summary(test)