Master's Course, SS2015
Faculty of Physics and Astronomy, University of Heidelberg

Computational Statistics and Data Analysis (MVComp2)

Lecturer: PD Dr. Coryn Bailer-Jones
Assistants: Dr. Morgan Fouesneau (exercise class and marking), Dr. Dae-Won Kim (exercise marking)

Time and location:
Starting on Tuesday 14 April 2015
Lecture: 09:15 to 11:00 Tuesdays, In Neuenheimer Feld 227 ("KIP") room HS 2
Exercises: 16:15 to 18:00 Wednesdays, Philosophenweg 12, room CIP

Quick links
Overview
Prerequisites
Course formalities and registration
Lectures
Exercises and homework
Exam
Textbooks
(Semi-)popular books
Websites

Quick links

Lecture notes
Textbook based on the course
Longer R codes, and data, as used in the lecture notes
Exercises and homework
Exercise groups
Discussion forum (moodle)
Course formalities

This course will provide an introduction to using statistics and computational methods to analyse data. It comprises one 2-hour lecture per week, plus one 2-hour exercise session per week during which you will put into practice what you have learned in the lectures (on paper and on the computer). There will also be homework assignments. The course counts for 6 credit points, which corresponds to 180 hours, of which no more than 50 will be contact time (lectures, exercise). There will be an exam, which you need to pass to get the credit points. The course will be held in English.

This course will take a pragmatic approach. The focus will be on concepts, understanding problems, and the application of techniques to solving problems, rather than reproducing proofs or teaching you recipes to memorize.

Prerequisites

This course is part of the physics Master's programme, but may also be taken within the Bachelor's course. You do not need to have attended any specific university-level courses in statistics, but I will assume that you have completed the first few semesters of the physics Bachelor or other quantitative science course. Look at what is stated in the Master's or Bachelor's module handbooks for further information. The course will involve programming (you can use any language you like). If you don't yet have any programming experience, we strongly recommend you learn a language, and suggest R or python.

Course formalities and registration

This lecture is open to anyone who is interested.

Summary of course formalities. Registration for the exercises (necessary if you want credit points) is now closed.
Note in particular:

to get the credit points for the course you need to pass the final exam
to be able to sit the exam you need to get 60% of the marks across all the homeworks (but not in each individual homework sheet for that week)
each homework sheet has a maximum of 100 points

The homework must be handed in at the lecture following the exercise class. It must be handed in on paper. We recommend you build groups of up to three people for the exercises (one submission per group). Please keep these groups constant throughout the course.

Lectures

There will be 12 lectures on the following dates (the exercise session is on the following day). The topics allocated to the dates may well change.

Cumulative lecture notes will be made available. It will be updated with the latest lecture by the end of the day on which the lecture is held. Corrections to earlier chapters will also be made as necessary.
Link to the most recent version of the lecture notes (PDF)

Lecture date Topic

14 April Introduction and probability basics

21 April Estimation and errors: Describing data

28 April Statistical models and inference

5 May Linear models and regression

12 May Parameter estimation I

19 May Parameter estimation II

26 May Monte Carlo methods for inference

2 June No lecture

9 June Frequentist hypothesis testing

16 June Model comparison

23 June Cross validation, regularization, and basis functions

30 June Kernels and other things

7 July Classification

14 July Study week (no lecture)

21 July Exam week (see below)

Lecture date	Topic
14 April	Introduction and probability basics
21 April	Estimation and errors: Describing data
28 April	Statistical models and inference
5 May	Linear models and regression
12 May	Parameter estimation I
19 May	Parameter estimation II
26 May	Monte Carlo methods for inference
2 June	No lecture
9 June	Frequentist hypothesis testing
16 June	Model comparison
23 June	Cross validation, regularization, and basis functions
30 June	Kernels and other things
7 July	Classification
14 July	Study week (no lecture)
21 July	Exam week (see below)

Exercises and homework

There is one exercise class for all students. It will mostly deal with the topics covered in the lecture on the previous day. This is an integral part of the lecture course, as you only really learn by doing. For each exercise class there is also a set of homework questions. For full information see the exercise/homework class web site

In case you decide to use R for the computer-based exercises, you will find R code for most of the examples presented in the lectures embedded in the lecture notes. The longer R codes are available as separate files in this zip file, which also contains some of the data sets used in the lectures. This will be updated weekly after the lecture.

Registration for the exercises classes is now closed. You are welcome to attend the classes, but if you haven't registered, you won't be able to get the credit points.

There is discussion forum (moodle) associated with this course. For each lecture/exercise week there is a separate forum. We recommend everyone sign up for this, as it will be used to clarify things about the exercises in particular. Contact me if you still need the registration password.

Exam

The exam will be help on Thursday 23 July from 09:30 to 11:30 in the large lecture room ("Grosser Hoersaal") in Philosophenweg 12.

To be admitted to the exam you need to get 60% of the marks in the homeworks. As there will be 11 homeworks, each of 100 points maximum, you need to get at least 660 points.

Attendance at the exam is mandatory if you want to get the credit points. Missing the exam can only be permitted in extreme circumstances (e.g. relevant medical condition supported by evidence) and if discussed with the lecturer in advance. In these cases (only) you will be permitted to take another exam at a later date.

This is a pen-and-paper exam. You are not permitted to bring a calculator, computer, mobile phone, or any other electronic device. You are also not permitted to bring any notes: cheat sheets ("Spickzettel") are not allowed. All you need is a pen and a ruler. All bags need to be left outside or at the front of the room. Bring your student ID with you.

Admission to the exam room is from 09:15. Exam scripts may not leave the exam room. Leaving the exam room before you are finished will only be allowed in exceptional circumstances, and you must leave your script with the invigilators.

Textbooks

(update 2017): Practical Bayesian inference: A primer for physical scientists

I recommend that you get hold of an introductory statistics text to use during this course. There are many around, varying in their scope, level, emphasis and quality. The course does not follow a single book; below is a random sample. The course focuses on the use of statistics in the physical sciences, so some even basic recipes in the social sciences will not be covered: you may want to take this into account when getting a book. There are several texts which examine specifially the use of R in statistics, although these tend to be bit too recipe-oriented to obtain a proper level of understanding.

Barlow, Statistics
A classic. This is a well-written introduction with some useful mathematical background and simple derivations and good descriptions. It is written for physics students, so it even has a chapter titled "Errors". I can recommend it if you want to go beyond just having recipes (which you should), in particular as it contains derivations which Crawley, Everitt & Hothorn and Dalsgaard omit. Like most introductory statistics text books, it takes a very orthodox or frequentist approach (probability only appears in chapter 7!), which can make the different topics seem like set of disconnected techniques.

Crawley, Statistics. An Introduction using R
This text emphasises statistics for biological and to some extent physical (but not social) sciences. It has a reasonable balance between explaining the methods and demonstrating them in R. While there are examples, there is more of an emphasis on principles and the basic maths than there is in Everit & Hawthorn or Dalgard, for example. Indeed, the maths is very basic and many methods are not properly explained (the course will go beyond this level). However, it is visually appealing and has the advantage of being relatively cheap. Like most statistics books, it presents statistics in the traditional way (look at the Table of Contents),

Dalgaard, Introductory Statistics with R
An introduction to both R and statistics. The mathematical treatment is limited and it takes a somewhat "recipes"-like approach. As the title suggests, R takes a central role. Includes exercises and answers.

Everitt and Hothorn, A Handbook of Statistical Analyses using R
R takes quite a very central place, with lots of examples, data sets (and perhaps a few too-many screen dumps). As the title suggests, this is a guide to using R for statistics rather than a book from which you can learn statistics. Moreover, it covers several topics which are not typical for an introductory statistics course (and which we won't cover). It is as R-centric as Crawley and Dalgaard but a bit more advanced.

Gregory, Bayesian logical data analysis for the physical sciences
A good introduction to both the principles and practical application of Bayesian methods. One of very few books giving a broad introduction and guide for physical scientists (there are lots more such books for social scientists and specific analytic models). He uses Mathematica to illustrate the methods. Highly recommended.

Ivezic et al., Statistics, data mining, and machine learning in astronomy
An excellent, well-written compilation of statistical methods and machine learning methods in general, with particular attention to their application in astronomy. Lots of examples and code in python on an accompanying web site.

Jaynes, Probability theory
E.T. Jaynes was one of the main proponents of Bayesian inference. This is a a rather unconventional book describing numerous elements of Bayesian probability theory and inference, ranging from the basics through pratical examples to funadamental philosophical discussions. This book is unconventional and even polemical in places, and is probably not appropriate for a first exposure to Bayesian inference. But it contains some very thought-provoking discussions.

Lyons, Statistics for nuclear and particle physicists
The title notwithstanding, this is a useful introduction to the use of statistics on the physical sciences in general. It has some good practical advice and side notes, but is deeply orthodox in its approach to inference (the words "posterior" or "Bayes" don't even appear in the index).

Mackay, Information theory, inference and learning algorithms
Not a traditional statistics book, and perhaps not a first book for learning the very basics of Bayesian inference, but a great book for learning about inference both in principle and in practice. He has a good didactic style, and this book contains some very illuminating examples. Also look here for a good introduction to MCMC. Mackay and CUP have done us a great service by making the book available online.

Maindonald and Braun, Data Analysis and Graphics using R
This is essentially a handbook for using R for statistical data analysis rather than a book from which to learn statistics. It is similar in approach and coverage to the clasic book of Venables & Ripley (see below), in that it also covers what one would call machine learning methods (e.g. trees, discriminant analysis), but at a slightly lower level. It contains very little mathematics. At 26cm x 18cm x 3.5cm, it won't fit in your pocket.

Sivia, Data Analysis. A Bayesian Tutorial
The first edition was a great introduction to data analysis in the Bayesian perspective. (A new second edition adds three more chapters.) I recommended it if really want to understand what statistics is and how it relates to probability theory, rather than just learn a bunch of frequentist recipes. That is, don't look in here for p-values and Neyman-Pearson hypothesis testing. It includes numerous examples which are analytically solvable, but covers less on the numerical solutions. It goes beyond the scope of the course, and it does not cover R or other packages. If you only ever read one book on statistics, make it this one.

Sachs, Angewandte Statistik. Methodensammlung mit R
A very detailed and mathematical introduction to statistics. It contains a lot more than you'll need for the course but the level of mathematics is not as high (or as offputting) as first appearances might suggest. R is used to illustrate the statistics (rather than the other way around, as is the case is some other books). With problems and solutions. I've not used this book, but judging from (a) the Forward, (b) the lack of virtually any reference to Bayesian statistics or Richard Cox or Harold Jeffreys, this is an unashamedly frequentist approach to statistics. You have been warned!

Toutenburg and Heumann, Deskriptive Statistik and Induktive Statistik
This pair of books - in German - gives a detailed introduction to statistics and R from a somewhat mathematical perspective. It goes into more theory and depth than you'll need for this course. Lots of examples and solutions. I've not used it myself.

Venables and Ripley, Modern Applied Statistics with S (MASS)
"S" is a flavour of R. This books provides a very good introduction to R and its use for both basic and advanced data analysis. However, it assumes the reader is already reasonably familar with the techniques, so this is not a book which can be used alone to learn basic statistics. It goes well beyond the course, covering also topics such as GLIMs, neural networks and spatial statistics. The accompanying R package "MASS" contains many functions which will be used in the course.

Verzani, Using R for introductory statistics
Quite R-oriented and rather (too) basic. It's essentially an R guide rather than a statistics text.

(Semi-)popular books

Here is a sample of popular or semi-popular books on probability which I have read and which I can recommend to anyone interested in how probability can be used in every day life. (I don't necessarily agree with everything written in these books, but much.)

Evans, Dylan, Risk Intelligence
A study of how we (should) use simple probability theory in everyday life to help us assess risks and make decisions. Evans' thesis is that many people, regardless of intelligence, have poor risk intelligence, i.e. are not very good at assessing probability, risk, expected gains and losses. This is a very readable and insightful book.

Gigerenzer, Gerd, Reckoning with Risk
A look at how uncertainty and probability is represented and, more often, misrepresented in everyday life: in the media, in law, and especially in medicine. He guides you through interpreting probabilistic information, and how you can use this correctly to make informed decisions. He has some very interesting examples. I can also recommend his other book Risk Savvy.

Kahneman, Daniel, Thinking, fast and slow
A collection of very interesting insights - and results of experiments and surveys - into how we think about probability and statistics. He looks as how people actually assess information and make decisions. One of the main theses is that our intuitive brain is rather poor (in particular, biased) at probabilistic assessments. Very readable, and much of it is convincing.

Websites

On uncertainty, risk, bias, etc.:

On R:

Coryn Bailer-Jones, calj at mpia.de
Last updated 7 July 2015

Master's Course, SS2015 Faculty of Physics and Astronomy, University of Heidelberg

Computational Statistics and Data Analysis (MVComp2)

Contents

Master's Course, SS2015
Faculty of Physics and Astronomy, University of Heidelberg