R scripts for the lecture course
Machine Learning, pattern recognition and statistical data modelling
Coryn A.L. Bailer-Jones, 2007

Linear Methods Part 2
---------------------

***** LDA

Model class distributions as multivariate Gaussians. LDA is the special case
when we assume that the classes have equal covariance matrices (Hastie et al. p. 86). If
we drop this assumption, we have QDA.  Let there be g classes in a p
dimensional data space.  LDA determines the means of each class, and uses the
proximity of a new data point to these to make a classification (taking into
account the covariance matrix and prior probabilities - if we sphere the data
and have equal priors, we just take the nearest centroid). Further, the g
centroids in the p-dimensional data space lie in an affine subspace of
dimension g-1. In locating the nearest centroid, directions orthogonal to this
subspace do not discriminate between the classes (within the LDA assumptions),
so we may as well project the data onto this subspace. Thus LDA performs a
dimension reduction: p -> g-1, i.e. for g classes there are g-1 LDs.  The LDs
are mutually orthogonal vectors in the (sphered) data space which maximally
discriminate between the classes. Can also show that they maximise the
between-class variance relative to the within-class variance (MASS4 section
12.1). LDs are ordered in terms of separation significance.

data(iris3)
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))
ir.lda <- lda(Sp ~ ., Iris)

For dataframe Iris, perform an LDA where the groups are given by named column
Sp, and the explanatory variables are everything else (this is the meaning of
the "." on the RHS of the formula; at least, we get the same result if we use
"Sepal.L. + Sepal.W. + Petal.L. + Petal.W." on the RHS). To predict using this
we use

ir.ld <- predict(ir.lda)

which, as "newdata" is not specified, predicts using the original data set. To
plot:

eqscplot(ir.ld$x, type="n", xlab="LD1", ylab="LD2")	# plot window
# plot data as characters
text(ir.ld$x, labels=as.character(Iris$Sp, col=3 + codes(Iris$Sp), cex=0.8)) 

Evaluate contingency table of classification results:

t(table(Iris$Sp, ir.ld$class))  # transpose so that a true class is a colum

In LDA class boundaries are linear hyperplanes: for g classes, there are
g(g-1) boundaries.  The separating hyperplanes (boundaries)between the groups
are NOT simply the perpendicular bisectors of the lines joining the class
centroids.  It is only this in the case that the covariance matrix is
spherical (= Identity matrix times constant) and the priors are equal. (The
former case could be satisfied by plotting with the sphered data.)  Boundaries
are plotted in crabs example in MASS at the bottom on p. 335 using the perp()
function.  See Figs. 4.1 and 4.5 of Hastie et al.. In practice, however, (esp. for QDA)
it might be easier to compute the boundaries exhaustively, i.e. compute the
decision rule on a finite lattice of points and then use a contouring method
to compute the boundaries (as in Hastie et al. - footnote on p. 88).

# Repeat but with separate train and test sets
train <- sample(1:150, 100)
ir.lda <- lda(Sp ~ ., Iris, subset=train)
ir.ld <- predict(ir.lda)
ir.ld.test <- predict(ir.lda, Iris[-train, ])$class
t(table(Iris$Sp[train], ir.ld$class))      # results on test set
t(table(Iris$Sp[-train], ir.ld.test))      # results on test set


***** Logistic regression

LDA made a direct assumption about the form of the likelihood. We relax this,
and now say that we want to model posterior probabilities as linear functions
of x and of course require that the probabilities remain in [0,1] and sum to
one.

# Apply R function polr to the Fisher iris data set
library(MASS)
data(iris3)
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))
ir.lr <- polr(Sp ~ ., Iris, method="logistic")
ir.lr
predict(ir.lr)
t(table(Iris$Sp, predict(ir.lr)))

Results are not that good, but we probably need to tune parameters of the
optimizer, optim()


***** logistic reggression with the Challenger data set

challenger <- read.table('challenger.dat', header=TRUE)
attach(challenger)
challenger
# plot all data
plot(temperature, r, xlim=c(20,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. of O-ring failures')
abline(v=31, col='red')

How can we set up a model to predict probabiility of O-ring failures at t=31 F?

To fit a logistic regression model we use Generalized Linear Models. This uses
maximum likelihood to find the model parameters. Section 7.2 of MASS explains
how to use the glm() function for binomial data. One way to use it is for the
response to be a two column matrix with the first column the no. of successes
and the second column the number of failures.

Note that we are directly modelling the probability, p, of an O-ring failure,
and this is equal to the expcted proportion of O-ring failures p = <r>/m
because of the binomial theorem.

# First check dependence on both variables
# use jitter to avoid overplotting points
plot(jitter(pressure, factor=0.5), r, xlim=c(40,210), ylim=c(0,6), xlab='Pressure', ylab='No. of O-ring failures')
fit0.glm <- glm(cbind(r,m-r) ~ temperature + pressure, data=challenger, family=binomial)
summary(fit0.glm)
# Doesn't look like there is any significant pressure dependence

# Just fit temperature dependence. Replot data against temperature
plot(temperature, r, xlim=c(20,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. of O-ring failures')
abline(v=31, col='red')
# Fit
fit.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, family=binomial)
# anova(fit.glm)
summary(fit.glm)
# predict probability of failure of a single O-ring joint at the following temperature
testtemp <- seq(10,100,1)  
pred.glm <- predict(fit.glm, data.frame(temperature=testtemp), type="response" )
lines(testtemp, 6*pred.glm)

Fundamental issue discussed in the pre-launch telecon between NASA and SRB
makers Morton Thiokol was whether the probability of O-ring failure depended
on temperature. They looked only at the data from previous launches in which
there were failures and concluded that there was no dependence. But if we
included the launches with no failures, we see evidence for a dependence.

# plot only launches with at least one failure
failures <- which(r!=0)
plot(temperature[failures], r[failures], xlim=c(30,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. O-ring failures', col='blue')
abline(v=31, col='red')
# Fit only to data with failures
fit2.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, subset=which(r!=0), family=binomial) 
lines(testtemp, 6*predict(fit2.glm, data.frame(temperature=testtemp), type="response"))
# overplot launches without failures
points(temperature[-failures], r[-failures])
# Replot fit with failures included
lines(testtemp, 6*pred.glm)

What is causing the predicted probability of failure to rise at low
temperatures? Is it the point with two failures at t=53 F ? In other words,
how sensitive is the model to the data?

# Fit again with point at t=53 F removed
fit3.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, subset=which(temperature!=53), family=binomial) 
lines(testtemp, 6*predict(fit3.glm, data.frame(temperature=testtemp), type="response"))

Thus the point at t=53 F does not have a big effect.  So how is the model
making predictions at t=53 F?  Rather, the form of the regression model means
that the large number of non-failures at high temperature combined with the
few failures at low temperatures implies an increasing probabilty of failure
at low temperatures.

Full model predicts p(t=31 F) = 0.82, i.e. 6*0.82=4.9 O-ring failures (this is
obvious but is also the mean of the binomial dstribution, m*p). Note that this
is a very large uncertainty in the logistic regression extrapolation at t=31
F. A bootstrap 90% confidence interval is (1/6,1), from Dalal et al.

# The probability of r failures from m (=6) where each has a probability p is given by the 
# binomial distribution
dbinom(r, m, p)
# To see all probabilities do
dbinom(x=c(0,1,2,3,4,5,6), size=6, p=0.82)
# Thus probability that all least one O-ring fails is six fail is 
1 - dbinom(x=0, size=6, p=0.82) 
# = sum(dbinom(x=c(1,2,3,4,5,6), size=6, p=0.82))

Is the logistic regression a reasonable model?
What about the problem of extrapolation?
Can we answer these questions just using statistics?

What basic assumptions have been made with the logistic regression model of
Dalal et al? Is the model appropriate?

1.We are assuming that the logistic model holds over the whole (extrapolated)
  temperature range.

2.The data suggest that there is a temperature dependence between 53 and 81 F,
  but they don't say whether there is an effect below 53 F or whether this is
  smooth. Quality of fit between 53 and 81 F doesn't tell us whether the
  extrapolation is appropriate.

3.Various other models give similar fits (see paper by Lavine), but we cannot
  use statistics do decide which is appropriate. For this we need engineering
  knowledge.  (Without some external knowledge we cannot extrapolate at all!)
  Extrapolation is risky, but when asked for a launch decision you have little
  choice.

4.Our interpretation of the model results (but not the model itself) is that
  the higher the propobablity of O-ring failure the more likely is the
  catastrophic loss of the Shuttle. Note that we are not saying that the loss
  of all 6 O-rings implies definite loss of the shuttle. In fact, loss of all
  6 is not that likely according to our model: Pr(R=6)=0.30. Physically, one
  or more O-ring failures could destroy the Shuttle. But loss of an O-ring
  also does not mean the Shuttle will explode (several successful flights
  showed one or two O-ring failures). All we are modelling is the probability
  dependence of O-ring failure on temperature. We cannot use these data alone
  to model the probability of a launch failure (as there had not been any so
  far).  So this analysis does not actually answers the real question of
  interest, namely ´What is the probability of a catastrophe at a lauch
  temperature of t=31 F?