Why science is not really about falsification

A popular impression of science is that it works by falsification. The idea is that a scientist specifies a hypothesis which makes a clear prediction. If an observation or measurement is made which contradicts this prediction, the hypothesis is falsified. A much used analogy is that of the black swan. The hypothesis is that all swans are white; the observation of one black swan falsifies this hypothesis. Null hypothesis testing attempts to put rejection of the hypothesis on a numerical basis by using the p value to specify a rejection confidence.

It's a nice story, but science does not work like this in practice. First, we rarely observe directly what the hypothesis states. In reality we observe something (e.g. the motions of planets) in order to infer the truth (or perhaps just the utility) of the hypothesis (e.g. "the Sun and planets orbit the Earth"). Second, a hypothesis rarely makes an explicit binary prediction. We must instead set up a model which generally predicts the size of an effect, not simply its presence or absence. Third, measurements are noisy (or are a sample from a population), so we cannot say with certainty whether a prediction is correct; we just get a measure of proximity between data and model predictions. The black swan is a poor caricature of science, and by sticking to it we would have to ask what we mean by "black" and indeed what we mean by "swan". What do we conclude if we observe a grey swan, or a bird that looks much like a swan but has small genetic differences? Testing models is not black and white (excuse the pun). Fourth, a model is only ever an approximation to reality. All models are therefore false at some level, so what does it even mean to falsify a model? If we required models to be true, we would end up rejecting them all. Yet science is full of approximate models which are very useful in practice: the model of gases as non-interacting particles with zero volume; classical electromagnetism; Newton's law of gravity. All of these are false models of reality, yet they are very accurate in certain domains.

In the real world we cannot falsify a hypothesis or model any more than we "truthify" it. We can only determine a degree (probability) of model accuracy, of which "true" and "false" are the two extremes. Our model predictions will never agree with measurements exactly. We must therefore ask to what extent measured data support a specified model. As we have seen in earlier chapters, this is not an absolute measure: even under the true model, the observed data can be very unlikely (e.g. section 5.2). Thus not only are we unable to reject a model in an absolute sense, we cannot even do it probabilistically in any meaningful sense. All we can do is ask which of the available models explains the data best. That is, we must compare models. How to do this is the subject of the next chapter.

Coryn Bailer-Jones.
Extract from my book Practical Bayesian Inference (pp. 223-224).