R scripts for the lecture course Machine Learning, pattern recognition and statistical data modelling Coryn A.L. Bailer-Jones, 2007 Introduction ------------ *** Example of using R (from MASS section 1.3) library(MASS) # simple stuff in R x <- rnorm(1000) y <- rnorm(1000) truehist(x, nbins=25) plot(x,x+y) dd <- kde2d(x,x+y) # kernel density estimate contour(dd) points(x,x+y) image(dd) # 1D fits hills # MASS4 p. 8 pairs(hills) # matrix of scatter plots attach(hills) plot(dist, time) fit1 <- lm(time ~ dist) # linear regression abline(fit1, col="red", lw=2) summary(fit1) # see significance of components of model fit1.1 <- lqs(time ~ dist) # robust linear regression abline(fit1.1, col="green", lw=2) detach(hills) michelson # MASS 4 p. 10 attach(michelson) plot(Expt, Speed, xlab="Experiment no.", main="Speed of light data") # boxplot plot(Run, Speed, xlab="Run no.", main="Speed of light data") fm <- aov(Speed ~ Run + Expt) # Run and Expt are factors summary(fm) Result shows that Run is not an important factor. Box plot description: central bar is median (50% position in data distribution). Upper and lower limits of box are the upper and lower quartiles, i.e. 25% and 75% positions in data distribution. Their difference is the IQR (interquartile range). Upper bracket shows the largest observation less than or equal to the upper quartile plus 1.5IQR. Lower braclet shows the smallest observations greater than or equal to the lower quartile minus 1.5 IQR. Student's t distribution: Let x be normally distributed. (x-m)/sigma is not normally distributed (with zero mean and unit variance) when sigma is estimated from the data. Rather it follows a t distribution (Gaussian-like but with broader tails) *** Learning, generalization and regularization # After Bishop ch. 9 x <- seq(0,1,0.05) h <- function(x){0.5 + 0.4*sin(2*pi*x)} # define a function set.seed(10) h_noisy <- h(x) + rnorm(n=x, sd=0.05) plot(x, h_noisy, pch=16) x_temp <- seq(0,1,0.01) lines(x_temp, h(x_temp), lty=2, lwd=2) # Predict using locpoly # With locpoly we cannot define a prediction grid (only number of points spread uniformly) library(KernSmooth) locpoly(x, h_noisy, bandwidth=0.05) lines(locpoly(x, h_noisy, bandwidth=0.05), col='blue', lwd=2) # Repeat with bandwith values: # 0.15 %G–%@ too smooth # 0.02 %G–%@ overfit # Predict using ksmooth library(stats) # replot! ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal') lines(ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal'), col='red', lwd=2) # Repeat with bandwith values: # 0.5 %G–%@ too smooth # 0.1 %G–%@ okay # 0.05 %G–%@ overfit # Define a test set xtest <- seq(0.025,1,0.05) set.seed(20) htest_noisy <- h(xtest) + rnorm(n=xtest, sd=0.05) # replot -Y´training¡ data! points(xtest, htest_noisy, col='blue', pch=17) # predict at test data points ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.2, kernel='normal') resid <- ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.2, kernel='normal')$y - htest_noisy # define rmse function rmse <- function(x){ sqrt( (1/length(x))*sum(x^2) ) } rmse(resid) # training function error rmse( ksmooth(x, h_noisy, x.points=x, bandwidth=0.05, kernel='normal')$y - h_noisy ) # test function error rmse( ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.05, kernel='normal')$y - htest_noisy ) Note that this application of locpoly is using a linear fit over the bandwidth region about each point, yet appears to produce a nonlinear function. Actually, it doesn't produce a function at all: it is just doing point estimations at 401 points equally spread over the x-axis (this value is fixed by the parameter gridsize). That is, at each of these points it does a linear fit over the data lying within the box of size +/- bandwidth/2 and this is the predicted y-value. # With neural network. Stubbornly refuses to overfit predict(nnet(x,h_noisy,size=5,decay=0,maxits=1000,linout=TRUE), as.matrix(x)) A better fit would be obtained using a sine function, of course, or at least something which is periodic. Domain knowledge helps, but only if it is correct!