Faculty of Physics and Astronomy, University of Heidelberg

**Lecturer**: PD Dr. Coryn Bailer-Jones

**Assistants**: Dr. Morgan Fouesneau (exercise class and marking), Dr. Dae-Won Kim (exercise marking)

**Time and location**:

Starting on Tuesday 14 April 2015

Lecture: 09:15 to 11:00 Tuesdays, In Neuenheimer Feld 227 ("KIP") room HS 2

Exercises: 16:15 to 18:00 Wednesdays, Philosophenweg 12, room CIP

Quick links

Overview

Prerequisites

Course formalities and registration

Lectures

Exercises and homework

Exam

Textbooks

(Semi-)popular books

Websites

Lecture notes

Textbook based on the course

Longer R codes, and data, as used in the lecture notes

Exercises and homework

Exercise groups

Discussion forum (moodle)

Course formalities

This course will provide an introduction to using statistics and computational methods to analyse data. It comprises one 2-hour lecture per week, plus one 2-hour exercise session per week during which you will put into practice what you have learned in the lectures (on paper and on the computer). There will also be homework assignments. The course counts for 6 credit points, which corresponds to 180 hours, of which no more than 50 will be contact time (lectures, exercise). There will be an exam, which you need to pass to get the credit points. The course will be held in English.

This course will take a pragmatic approach. The focus will be on concepts, understanding problems, and the application of techniques to solving problems, rather than reproducing proofs or teaching you recipes to memorize.

This course is part of the physics Master's programme, but may also be taken within the Bachelor's course. You do not need to have attended any specific university-level courses in statistics, but I will assume that you have completed the first few semesters of the physics Bachelor or other quantitative science course. Look at what is stated in the Master's or Bachelor's module handbooks for further information. The course will involve programming (you can use any language you like). If you don't yet have any programming experience, we strongly recommend you learn a language, and suggest R or python.

This lecture is open to anyone who is interested.

Summary of course formalities. Registration for the exercises (necessary if you want credit points) is now closed.

Note in particular:

- to get the credit points for the course you need to pass the final exam
- to be able to sit the exam you need to get 60% of the marks across all the homeworks (but not in each individual homework sheet for that week)
- each homework sheet has a maximum of 100 points

There will be 12 lectures on the following dates (the exercise session is on the following day). The topics allocated to the dates may well change.

Cumulative lecture notes will be made available. It will be updated with the latest lecture by the end of the day on which the lecture is held. Corrections to earlier chapters will also be made as necessary.

Link to the most recent version of the lecture notes (PDF)

Lecture date | Topic |
---|---|

14 April | Introduction and probability basics |

21 April | Estimation and errors: Describing data |

28 April | Statistical models and inference |

5 May | Linear models and regression |

12 May | Parameter estimation I |

19 May | Parameter estimation II |

26 May | Monte Carlo methods for inference |

2 June | No lecture |

9 June | Frequentist hypothesis testing |

16 June | Model comparison |

23 June | Cross validation, regularization, and basis functions |

30 June | Kernels and other things |

7 July | Classification |

14 July | Study week (no lecture) |

21 July | Exam week (see below) |

There is one exercise class for all students. It will mostly deal with the topics covered in the lecture on the previous day. This is an integral part of the lecture course, as you only really learn by doing. For each exercise class there is also a set of homework questions. For full information see the exercise/homework class web site

In case you decide to use R for the computer-based exercises, you will find R code for most of the examples presented in the lectures embedded in the lecture notes. The longer R codes are available as separate files in this zip file, which also contains some of the data sets used in the lectures. This will be updated weekly after the lecture.

Registration for the exercises classes is now closed. You are welcome to attend the classes, but if you haven't registered, you won't be able to get the credit points.

There is discussion forum (moodle) associated with this course. For each lecture/exercise week there is a separate forum. We recommend everyone sign up for this, as it will be used to clarify things about the exercises in particular. Contact me if you still need the registration password.

I recommend that you get hold of an introductory statistics text to use during this course. There are many around, varying in their scope, level, emphasis and quality. The course does not follow a single book; below is a random sample. The course focuses on the use of statistics in the physical sciences, so some even basic recipes in the social sciences will not be covered: you may want to take this into account when getting a book. There are several texts which examine specifially the use of R in statistics, although these tend to be bit too recipe-oriented to obtain a proper level of understanding.

Barlow, **Statistics**

A classic. This is a well-written introduction with some useful mathematical
background and simple derivations and good descriptions. It is written
for physics students, so it even has a chapter titled "Errors". I can
recommend it if you want to go beyond just having recipes (which you
should), in particular as it contains derivations which Crawley,
Everitt & Hothorn and Dalsgaard omit. Like most introductory
statistics text books, it takes a very orthodox or frequentist
approach (probability only appears in chapter 7!), which can make the
different topics seem like set of disconnected techniques.

Crawley, **Statistics. An Introduction using R**

This text emphasises statistics for biological and to some extent
physical (but not social) sciences. It has a reasonable balance
between explaining the methods and demonstrating them in R. While
there are examples, there is more of an emphasis on principles and the
basic maths than there is in Everit & Hawthorn or Dalgard, for
example. Indeed, the maths is very basic and many methods are not
properly explained (the course will go beyond this level). However, it
is visually appealing and has the advantage of being relatively cheap.
Like most statistics books, it presents statistics in the traditional
way (look at the Table of Contents),

Dalgaard, **Introductory Statistics with R**

An introduction to both R and statistics. The mathematical treatment is limited
and it takes a somewhat
"recipes"-like approach. As the title suggests, R takes a central role.
Includes exercises and answers.

Everitt and Hothorn, **A Handbook of Statistical Analyses using R**

R takes quite a very central place, with lots of examples, data sets
(and perhaps a few too-many screen dumps). As the title suggests, this
is a guide to using R for statistics rather than a book from which you
can learn statistics. Moreover, it covers several topics which are not
typical for an introductory statistics course (and which we won't
cover). It is as R-centric as Crawley and Dalgaard but a bit more
advanced.

Gregory, **Bayesian logical data analysis for the physical sciences**

A good introduction to both the principles and practical application of Bayesian methods. One of very few books giving a broad introduction and guide for physical scientists (there are lots more such books for social scientists and specific analytic models). He uses Mathematica to illustrate the methods. Highly recommended.

Ivezic et al., **Statistics, data mining, and machine learning in astronomy**

An excellent, well-written compilation of statistical methods and machine learning methods in general, with particular attention to their application in astronomy. Lots of examples and code in python on an accompanying web site.

Jaynes, **Probability theory**

E.T. Jaynes was one of the main proponents of Bayesian inference. This is a a rather unconventional book describing numerous elements of Bayesian probability theory and inference, ranging from the basics through pratical examples to funadamental philosophical discussions. This book is unconventional and even polemical in places, and is probably not appropriate for a first exposure to Bayesian inference. But it contains some very thought-provoking discussions.

Lyons, **Statistics for nuclear and particle physicists**

The title notwithstanding, this is a useful introduction to the use of
statistics on the physical sciences in general. It has some good practical advice and side notes, but is deeply orthodox in its approach to inference (the words "posterior" or "Bayes" don't even appear in the index).

Mackay, **Information theory, inference and learning algorithms**

Not a traditional statistics book, and perhaps not a first book for learning the very basics of Bayesian inference, but a great book for learning about inference both in principle and in practice. He has a good didactic style, and this book contains some very illuminating examples. Also look here for a good introduction to MCMC. Mackay and CUP have done us a great service by making the book available online.

Maindonald and Braun, **Data Analysis and Graphics using R**

This is essentially a handbook for using R for statistical data
analysis rather than a book from which to learn statistics. It is
similar in approach and coverage to the clasic book of Venables &
Ripley (see below), in that it also covers what one would call machine
learning methods (e.g. trees, discriminant analysis), but at a
slightly lower level. It contains very little mathematics. At 26cm x
18cm x 3.5cm, it won't fit in your pocket.

Sivia, **Data Analysis. A Bayesian Tutorial**

The first edition
was a great introduction to data analysis in the Bayesian
perspective. (A new second edition adds three more chapters.) I
recommended it if really want to understand what statistics is and how
it relates to probability theory, rather than just learn a bunch of
frequentist recipes. That is, don't look in here for p-values and
Neyman-Pearson hypothesis testing. It includes numerous examples
which are analytically solvable, but covers less on the numerical
solutions. It goes beyond the scope of the course, and it does not
cover R or other packages. If you only ever read one book on statistics, make it this one.

Sachs, **Angewandte Statistik. Methodensammlung mit R**

A very
detailed and mathematical introduction to statistics. It contains a
lot more than you'll need for the course but the level of mathematics
is not as high (or as offputting) as first appearances might
suggest. R is used to illustrate the statistics (rather than the other
way around, as is the case is some other books). With problems and
solutions. I've not used this book, but judging from (a) the Forward, (b) the lack of virtually
any reference to Bayesian statistics or
Richard Cox or Harold Jeffreys, this is an unashamedly frequentist
approach to statistics. You have been warned!

Toutenburg and Heumann, **Deskriptive Statistik** and **Induktive
Statistik**

This pair of books - in German - gives a detailed
introduction to statistics and R from a somewhat mathematical
perspective. It goes into more theory and depth than you'll need for
this course. Lots of examples and solutions. I've not used it myself.

Venables and Ripley, **Modern Applied Statistics with S** (MASS)

"S" is a flavour of R. This books provides a very
good introduction to R and its use for both basic and advanced data
analysis. However, it assumes the reader is already reasonably familar
with the techniques, so this is not a book which can be used alone to
learn basic statistics. It goes well beyond the course, covering also
topics such as GLIMs, neural networks and spatial statistics. The
accompanying R package "MASS" contains many functions which will be
used in the course.

Verzani, **Using R for introductory statistics**

Quite
R-oriented and rather (too) basic. It's essentially an R guide rather
than a statistics text.

Here is a sample of popular or semi-popular books on probability which I have read and which I can recommend to anyone interested in how probability can be used in every day life. (I don't necessarily agree with everything written in these books, but much.)

Evans, Dylan, **Risk Intelligence**

A study of how we (should) use simple probability theory in everyday life to help us assess risks and make decisions. Evans' thesis is that many people, regardless of intelligence, have poor *risk intelligence*, i.e. are not very good at assessing probability, risk, expected gains and losses. This is a very readable and insightful book.

Gigerenzer, Gerd, **Reckoning with Risk**

A look at how uncertainty and probability is represented and, more often, misrepresented in everyday life: in the media, in law, and especially in medicine. He guides you through interpreting probabilistic information, and how you can use this correctly to make informed decisions. He has some very interesting examples. I can also recommend his other book *Risk Savvy*.

Kahneman, Daniel, **Thinking, fast and slow**

A collection of very interesting insights - and results of experiments and surveys - into how we think about probability and statistics. He looks as how people actually assess information and make decisions. One of the main theses is that our intuitive brain is rather poor (in particular, biased) at probabilistic assessments. Very readable, and much of it is convincing.

On uncertainty, risk, bias, etc.:

- Harding Center for Risk Literacy
- Understanding Uncertainty
- Gapminder (in particular, the ignorance project)

- R software package
- An introduction to R
- The R Wiki
- R programming Wikibook
- R blog
- R things from the astrostatistics center at Penn State
- A course on R and statistics at the Statlab, University of Heidelberg

Coryn Bailer-Jones, calj at mpia.de

Last updated 7 July 2015