R scripts for the lecture course Machine Learning, pattern recognition and statistical data modelling Coryn A.L. Bailer-Jones, 2007 Linear Methods Part 2 --------------------- ***** LDA Model class distributions as multivariate Gaussians. LDA is the special case when we assume that the classes have equal covariance matrices (Hastie et al. p. 86). If we drop this assumption, we have QDA. Let there be g classes in a p dimensional data space. LDA determines the means of each class, and uses the proximity of a new data point to these to make a classification (taking into account the covariance matrix and prior probabilities - if we sphere the data and have equal priors, we just take the nearest centroid). Further, the g centroids in the p-dimensional data space lie in an affine subspace of dimension g-1. In locating the nearest centroid, directions orthogonal to this subspace do not discriminate between the classes (within the LDA assumptions), so we may as well project the data onto this subspace. Thus LDA performs a dimension reduction: p -> g-1, i.e. for g classes there are g-1 LDs. The LDs are mutually orthogonal vectors in the (sphered) data space which maximally discriminate between the classes. Can also show that they maximise the between-class variance relative to the within-class variance (MASS4 section 12.1). LDs are ordered in terms of separation significance. data(iris3) Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3))) ir.lda <- lda(Sp ~ ., Iris) For dataframe Iris, perform an LDA where the groups are given by named column Sp, and the explanatory variables are everything else (this is the meaning of the "." on the RHS of the formula; at least, we get the same result if we use "Sepal.L. + Sepal.W. + Petal.L. + Petal.W." on the RHS). To predict using this we use ir.ld <- predict(ir.lda) which, as "newdata" is not specified, predicts using the original data set. To plot: eqscplot(ir.ld$x, type="n", xlab="LD1", ylab="LD2") # plot window # plot data as characters text(ir.ld$x, labels=as.character(Iris$Sp, col=3 + codes(Iris$Sp), cex=0.8)) Evaluate contingency table of classification results: t(table(Iris$Sp, ir.ld$class)) # transpose so that a true class is a colum In LDA class boundaries are linear hyperplanes: for g classes, there are g(g-1) boundaries. The separating hyperplanes (boundaries)between the groups are NOT simply the perpendicular bisectors of the lines joining the class centroids. It is only this in the case that the covariance matrix is spherical (= Identity matrix times constant) and the priors are equal. (The former case could be satisfied by plotting with the sphered data.) Boundaries are plotted in crabs example in MASS at the bottom on p. 335 using the perp() function. See Figs. 4.1 and 4.5 of Hastie et al.. In practice, however, (esp. for QDA) it might be easier to compute the boundaries exhaustively, i.e. compute the decision rule on a finite lattice of points and then use a contouring method to compute the boundaries (as in Hastie et al. - footnote on p. 88). # Repeat but with separate train and test sets train <- sample(1:150, 100) ir.lda <- lda(Sp ~ ., Iris, subset=train) ir.ld <- predict(ir.lda) ir.ld.test <- predict(ir.lda, Iris[-train, ])$class t(table(Iris$Sp[train], ir.ld$class)) # results on test set t(table(Iris$Sp[-train], ir.ld.test)) # results on test set ***** Logistic regression LDA made a direct assumption about the form of the likelihood. We relax this, and now say that we want to model posterior probabilities as linear functions of x and of course require that the probabilities remain in [0,1] and sum to one. # Apply R function polr to the Fisher iris data set library(MASS) data(iris3) Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3))) ir.lr <- polr(Sp ~ ., Iris, method="logistic") ir.lr predict(ir.lr) t(table(Iris$Sp, predict(ir.lr))) Results are not that good, but we probably need to tune parameters of the optimizer, optim() ***** logistic reggression with the Challenger data set challenger <- read.table('challenger.dat', header=TRUE) attach(challenger) challenger # plot all data plot(temperature, r, xlim=c(20,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. of O-ring failures') abline(v=31, col='red') How can we set up a model to predict probabiility of O-ring failures at t=31 F? To fit a logistic regression model we use Generalized Linear Models. This uses maximum likelihood to find the model parameters. Section 7.2 of MASS explains how to use the glm() function for binomial data. One way to use it is for the response to be a two column matrix with the first column the no. of successes and the second column the number of failures. Note that we are directly modelling the probability, p, of an O-ring failure, and this is equal to the expcted proportion of O-ring failures p = /m because of the binomial theorem. # First check dependence on both variables # use jitter to avoid overplotting points plot(jitter(pressure, factor=0.5), r, xlim=c(40,210), ylim=c(0,6), xlab='Pressure', ylab='No. of O-ring failures') fit0.glm <- glm(cbind(r,m-r) ~ temperature + pressure, data=challenger, family=binomial) summary(fit0.glm) # Doesn't look like there is any significant pressure dependence # Just fit temperature dependence. Replot data against temperature plot(temperature, r, xlim=c(20,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. of O-ring failures') abline(v=31, col='red') # Fit fit.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, family=binomial) # anova(fit.glm) summary(fit.glm) # predict probability of failure of a single O-ring joint at the following temperature testtemp <- seq(10,100,1) pred.glm <- predict(fit.glm, data.frame(temperature=testtemp), type="response" ) lines(testtemp, 6*pred.glm) Fundamental issue discussed in the pre-launch telecon between NASA and SRB makers Morton Thiokol was whether the probability of O-ring failure depended on temperature. They looked only at the data from previous launches in which there were failures and concluded that there was no dependence. But if we included the launches with no failures, we see evidence for a dependence. # plot only launches with at least one failure failures <- which(r!=0) plot(temperature[failures], r[failures], xlim=c(30,90), ylim=c(0,6), xlab='Joint temperature / F', ylab='No. O-ring failures', col='blue') abline(v=31, col='red') # Fit only to data with failures fit2.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, subset=which(r!=0), family=binomial) lines(testtemp, 6*predict(fit2.glm, data.frame(temperature=testtemp), type="response")) # overplot launches without failures points(temperature[-failures], r[-failures]) # Replot fit with failures included lines(testtemp, 6*pred.glm) What is causing the predicted probability of failure to rise at low temperatures? Is it the point with two failures at t=53 F ? In other words, how sensitive is the model to the data? # Fit again with point at t=53 F removed fit3.glm <- glm(cbind(r,m-r) ~ temperature, data=challenger, subset=which(temperature!=53), family=binomial) lines(testtemp, 6*predict(fit3.glm, data.frame(temperature=testtemp), type="response")) Thus the point at t=53 F does not have a big effect. So how is the model making predictions at t=53 F? Rather, the form of the regression model means that the large number of non-failures at high temperature combined with the few failures at low temperatures implies an increasing probabilty of failure at low temperatures. Full model predicts p(t=31 F) = 0.82, i.e. 6*0.82=4.9 O-ring failures (this is obvious but is also the mean of the binomial dstribution, m*p). Note that this is a very large uncertainty in the logistic regression extrapolation at t=31 F. A bootstrap 90% confidence interval is (1/6,1), from Dalal et al. # The probability of r failures from m (=6) where each has a probability p is given by the # binomial distribution dbinom(r, m, p) # To see all probabilities do dbinom(x=c(0,1,2,3,4,5,6), size=6, p=0.82) # Thus probability that all least one O-ring fails is six fail is 1 - dbinom(x=0, size=6, p=0.82) # = sum(dbinom(x=c(1,2,3,4,5,6), size=6, p=0.82)) Is the logistic regression a reasonable model? What about the problem of extrapolation? Can we answer these questions just using statistics? What basic assumptions have been made with the logistic regression model of Dalal et al? Is the model appropriate? 1.We are assuming that the logistic model holds over the whole (extrapolated) temperature range. 2.The data suggest that there is a temperature dependence between 53 and 81 F, but they don't say whether there is an effect below 53 F or whether this is smooth. Quality of fit between 53 and 81 F doesn't tell us whether the extrapolation is appropriate. 3.Various other models give similar fits (see paper by Lavine), but we cannot use statistics do decide which is appropriate. For this we need engineering knowledge. (Without some external knowledge we cannot extrapolate at all!) Extrapolation is risky, but when asked for a launch decision you have little choice. 4.Our interpretation of the model results (but not the model itself) is that the higher the propobablity of O-ring failure the more likely is the catastrophic loss of the Shuttle. Note that we are not saying that the loss of all 6 O-rings implies definite loss of the shuttle. In fact, loss of all 6 is not that likely according to our model: Pr(R=6)=0.30. Physically, one or more O-ring failures could destroy the Shuttle. But loss of an O-ring also does not mean the Shuttle will explode (several successful flights showed one or two O-ring failures). All we are modelling is the probability dependence of O-ring failure on temperature. We cannot use these data alone to model the probability of a launch failure (as there had not been any so far). So this analysis does not actually answers the real question of interest, namely ´What is the probability of a catastrophe at a lauch temperature of t=31 F?