R scripts for the lecture course
Machine Learning, pattern recognition and statistical data modelling
Coryn A.L. Bailer-Jones, 2007

Introduction
------------


*** Example of using R (from MASS section 1.3)

library(MASS)

# simple stuff in R
x <- rnorm(1000)
y <- rnorm(1000)
truehist(x, nbins=25)
plot(x,x+y)
dd <- kde2d(x,x+y)                    # kernel density estimate
contour(dd)
points(x,x+y)
image(dd)

# 1D fits
hills				       # MASS4 p. 8 
pairs(hills)			       # matrix of scatter plots
attach(hills)
plot(dist, time)
fit1 <- lm(time ~ dist)		       # linear regression
abline(fit1, col="red", lw=2)
summary(fit1)                          # see significance of components of model
fit1.1 <- lqs(time ~ dist)	       # robust linear regression
abline(fit1.1, col="green", lw=2)
detach(hills)

michelson				# MASS 4 p. 10
attach(michelson)
plot(Expt, Speed, xlab="Experiment no.", main="Speed of light data")	# boxplot
plot(Run, Speed, xlab="Run no.", main="Speed of light data")
fm <- aov(Speed ~ Run + Expt)		# Run and Expt are factors
summary(fm)
Result shows that Run is not an important factor.

Box plot description: central bar is median (50% position in data
distribution). Upper and lower limits of box are the upper and lower
quartiles, i.e. 25% and 75% positions in data distribution. Their difference
is the IQR (interquartile range). Upper bracket shows the largest observation
less than or equal to the upper quartile plus 1.5IQR. Lower braclet shows the
smallest observations greater than or equal to the lower quartile minus 1.5
IQR.

Student's t distribution: Let x be normally distributed. (x-m)/sigma is not
normally distributed (with zero mean and unit variance) when sigma is
estimated from the data. Rather it follows a t distribution (Gaussian-like but
with broader tails)


*** Learning, generalization and regularization

# After Bishop ch. 9
x <- seq(0,1,0.05)
h <- function(x){0.5 + 0.4*sin(2*pi*x)}             # define a function
set.seed(10)
h_noisy <- h(x) + rnorm(n=x, sd=0.05)
plot(x, h_noisy, pch=16)
x_temp <- seq(0,1,0.01)
lines(x_temp, h(x_temp), lty=2, lwd=2)

# Predict using locpoly
# With locpoly we cannot define a prediction grid (only number of points spread uniformly)
library(KernSmooth)
locpoly(x, h_noisy, bandwidth=0.05)
lines(locpoly(x, h_noisy, bandwidth=0.05), col='blue', lwd=2)
# Repeat with bandwith values:
# 0.15 %Gâ%@ too smooth
# 0.02 %Gâ%@ overfit

# Predict using ksmooth
library(stats)
# replot!
ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal')
lines(ksmooth(x, h_noisy, x.points=x, bandwidth=0.2, kernel='normal'), col='red', lwd=2)
# Repeat with bandwith values:
# 0.5 %Gâ%@ too smooth
# 0.1 %Gâ%@ okay 
# 0.05 %Gâ%@ overfit

# Define a test set
xtest <- seq(0.025,1,0.05)
set.seed(20)
htest_noisy <- h(xtest) + rnorm(n=xtest, sd=0.05)
# replot -Y´training¡ data!
points(xtest, htest_noisy, col='blue', pch=17)
# predict at test data points
ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.2, kernel='normal')
resid <- ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.2, kernel='normal')$y - htest_noisy
# define rmse function
rmse <- function(x){ sqrt( (1/length(x))*sum(x^2) ) }
rmse(resid)
# training function error
rmse( ksmooth(x, h_noisy, x.points=x, bandwidth=0.05, kernel='normal')$y - h_noisy )
# test function error
rmse( ksmooth(x, h_noisy, x.points=xtest, bandwidth=0.05, kernel='normal')$y - htest_noisy )

Note that this application of locpoly is using a linear fit over the bandwidth
region about each point, yet appears to produce a nonlinear
function. Actually, it doesn't produce a function at all: it is just doing
point estimations at 401 points equally spread over the x-axis (this value is
fixed by the parameter gridsize). That is, at each of these points it does a
linear fit over the data lying within the box of size +/- bandwidth/2 and this
is the predicted y-value.

# With neural network. Stubbornly refuses to overfit
predict(nnet(x,h_noisy,size=5,decay=0,maxits=1000,linout=TRUE), as.matrix(x))

A better fit would be obtained using a sine function, of course, or at least
something which is periodic. Domain knowledge helps, but only if it is
correct!