# Retrievals: Bayesian model fitting¶

Written by Paul Mollière.

We are starting off with some jargon immediately. "Retrievals" are the word exoplanet scientists say for: "Let's fit a model to our observed exoplanet spectrum." Granted, it is a bit fancier than that, since they usually mean "Bayesian" model fitting. That is, what they are really after are the properties of an atmosphere, given an observation.

Before going into more details on this, let's quickly review the classical $\chi^2$-minimization, that is, what we usually mean when talking about fitting. We will then see how this relates to retrievals further below.

## $\chi^2$-minimization¶

Most (all?) of you will have come across the concept of "fitting an observation" by so-called $\chi^2$-minimization. In it most simple form, one minimizes the quantity \begin{equation} \chi^2 = \sum_{i=1}^{N_\lambda}\frac{(y_i-f_i)^2}{\sigma_i^2}, \end{equation} where $y_i$ are the observed data points, $f_i$ is the model prediction of said data points, $\sigma_i$ is the measurement uncertainty, and $N_{\lambda}$ is the number of data points.

But where does this expression come from?

### Origin of $\chi^2$¶

Let's assume that the measurement uncertainties follow a Gauss distribution and that the measurements are independent. Then the probability of observing the data $\mathbf{y}$, assuming that we have the perfect model $\mathbf{F}(\mathbf{x})$ and the right model parameters $\mathbf{x}$ to fully describe our physical system, would be \begin{equation} P(\mathbf{y}|\mathbf{x}) = Ce^{-\chi^2/2}. \end{equation}

Here $$ C = \prod_{i=1}^{N_\lambda} \left(2\pi\sigma_i^2\right)^{-1/2} $$ is a normalization constant, ensuring that $\int P(\mathbf{y}|\mathbf{x})d\mathbf{y}=1$, as must be the case for any probability distribution. Note that we expressed the data, model, and model parameters as vectors now (so $\mathbf{y}$, $\mathbf{F}$, and $\mathbf{x}$, respectively; so $\mathbf{F}$ has elements $f_i$).

$\chi^2$ is thus related to $P(\mathbf{y}|\mathbf{x})$ as shown above, and by minimizing $\chi^2$ we look for a set of parameter values $\mathbf{x}_{\rm best}$ that maximize $P(\mathbf{y}|\mathbf{x})$.

### The case for something fancier than $\chi^2$-minimization¶

It is potentially not a bad assumption that $\mathbf{x}_{\rm best}$ is related to the most likely model parameters, given the measurement. But we would like to know the

*actual*distribution of $\mathbf{x}$ that are consistent with the data.Knowing the actual distribution of $\mathbf{x}$ has some nice advantages: we will see any correlations between model parameters, or true degeneracies, or if we are mostly insensitive to a parameter. We will also be able to compare different models and see which one is more likely.

## Retrieval definition¶

"Retrievals" do just that: they provide us with an estimate of $P(\mathbf{x}|\mathbf{y})$ (instead of maximizing $P(\mathbf{y}|\mathbf{x})$).

They thus answer the question: "

*what is the probability distribution of model parameter values, given an observation?*"For an atmospheric retrieval we do this by constructing a so-called "forward model" that can generate synthetic (and noise-free) observations as a function of the atmospheric properties, which are expressed through the parameters of the forward model.

The forward model is then compared to the data repeatedly with different parameter values, in order to determine the distribution of parameter values, given the data.

### Bayes' theorem¶

How do we go from $P(\mathbf{y}|\mathbf{x})$, our usual $\chi^2$, to $P(\mathbf{x}|\mathbf{y})$? Here Bayes' theorem is our friend!

Imagine you have two random variables, $x$ and $y$, described by a joint probability distribution P(x,y). For this is holds that \begin{equation} \int P(x,y) dxdy=1, \end{equation} when integrating over all possible values of $x$ and $y$.

If we are just interested in how, say, $x$ is distributed, we can compute its

*marginalized*probability distribution \begin{equation} P(x) = \int P(x,y) dy. \end{equation}Bayes theorem now simply says that \begin{equation} P(x|y)P(y) = P(y|x)P(x) = P(x,y). \end{equation}

Intuitively this makes sense (I claim): the so-called

*conditional*probability $P(y|x)$ to observe a certain value of $y$, given a certain value of $x$ is observed simultaneously, multiplied with the marginalized probability of $x$, is simply the joint probability of observing these values $x$ and $y$ at the same time.

### Applying Bayes' theorem in the context of atmospheric retrievals¶

Let's now apply Bayes' theorem to the specific case where we want to fit an exoplanet spectrum with a forward model with the ultimate goal to derive the probability distribution of the model parameters.

Let's begin by defining the following quantities:

- $\mathbf{y}$ is the data, in this case a vector of length $N_\lambda$ that contains the observed flux values $y_i$ at $N_\lambda$ wavelength points (so index $i$ runs from 1 to $N_\lambda$).
- The vector $\mathbf{\sigma}$ (also of length $N_\lambda$) contains the associated uncertainties $\sigma_i$ for the measured flux values $y_i$.
- Next we define the vector $\mathbf{x}$ which has the length $N_{\rm param}$. This vector encodes the properties of the atmosphere for our forward model; $N_{\rm param}$ is therefore the number of parameters. For example, $x_1$ may be the temperature at the top of the atmosphere, $x_2$ may be the temperature at the bottom, $x_3$ may be the abundance of the important absorber H$_2$O, etc.
- Finally, the forward-model $\mathbf{F}(\mathbf{x})$ is a function that takes in the vector $\mathbf{x}$ and predicts a synthetic (noise-free) observation by solving the radiative transfer equation, modeling the observational process, etc. That means that the output of $\mathbf{F}(\mathbf{x})$ is also a vector of length $N_{\rm \lambda}$, like the observation $\mathbf{y}$.

With a retrieval we now want to solve for $P(\mathbf{x}|\mathbf{y})$, so the probability distribution of forward model parameters, given the observation.

This desired distribution is called "posterior distribution" (or just "posterior"). Using Bayes' theorem, we can express the posterior like this $$ P(\mathbf{x}|\mathbf{y}) = \frac{1}{P(\mathbf{y})}\underbrace{P(\mathbf{x})}_{\rm "Prior"} \underbrace{P(\mathbf{y}|\mathbf{x})}_{\rm "Likelihood"}. $$

Here, $P(\mathbf{y})$ can again be regarded as a constant, ensuring that $\int P(\mathbf{x}|\mathbf{y}) d\mathbf{x}=1$.

The two other terms are the

*prior*and the*likelihood*, described in the following paragraphs.

### Priors¶

The "prior" $P(\mathbf{x})$ is an important quantity and in fact a key difference between a retrieval and a classical fit via $\chi^2$ minimization.

The prior encodes any

*prior*model parameter knowledge that we had about the atmosphere*before*making the observation.An obvious example for a prior would be $P(T<0) = 0$, that is, temperatures cannot be negative.

Another example is an independent measurement of the planet's mass and radius, giving rise to a prior on the planet's atmospheric gravity.

If we lack any prior information for a given parameter, we will use a broad, "uninformative" prior distribution (usually uniform or log-uniform).

### Likelihood¶

The likelihood brings our old accquaintance $\chi^2$ back.

It answers the question: "How likely is it that we observed the data $\mathbf{y}$, assuming that $\mathbf{x}$ correctly describes the state of the atmosphere?"

Assuming that the data points have independent uncertainties that follow a Gauss (normal) distribution we can write $$ P(\mathbf{y}|\mathbf{x}) = \prod_{i=1}^{N_\lambda}\frac{1}{\sqrt{2\pi\sigma_i^2}}{\rm exp}\left[-\frac{(y_i-f_i)^2}{2\sigma_i^2}\right]. $$

Here $f_i$ is the $i$-th element of $\mathbf{F}(\mathbf{x})$.

It is important to note that other forms of the likelihood can also be used. For example, if the data points are not independent, but correlated. In this case the likelihood becomes $$ P(\mathbf{y}|\mathbf{x}) = (2\pi)^{-N_\lambda/2}{\rm det}(\Sigma)^{-1/2} {\rm exp}\left\{ -\frac{1}{2}\left[\mathbf{y}-\mathbf{F}(\mathbf{x})\right]^{\rm T}\Sigma^{-1}\left[\mathbf{y}-\mathbf{F}(\mathbf{x})\right]\right\}, $$ where $\Sigma$ is the covariance matrix of the data. This is simply the general, multivariate Gauss distribution. Above's simpler case is what you get when the data points are not correlated, so $\Sigma$ is a diagonal matrix.

There can also be situations where the data do not follow a Gaussian distribution at all, in which case the likelihood may have an altogether different functional form.

### A numerical necessity: the log-likelihood¶

Since during the retrieval it can happen that we compare very small to very large likelihood values (which is challenging numerically) it is customary to work in log-space.

In this case the likelihood (usually denoted by $L$) becomes the so-called "log-likelihood" ${\rm log}(L)$: $$ {\rm log}(L) = -\frac{1}{2}\sum_{i=1}^{N_\lambda}\frac{(y_i-f_i)^2}{\sigma_i^2} + C, $$

Here $C=-0.5\sum_{i=1}^{N_{\lambda}}{\rm log}(2\pi \sigma_i^2)$ is a constant in case that the uncertainties are fixed. This is usually the case, but note that sometimes the magnitude (and covariance in general) of the uncertainties is also a free parameter of the retrieval.

The log-likelihhod therefore simply is $$ {\rm log}(L) = -\frac{\chi^2}{2}+C. $$

## Retrieval practicalities¶

- The goal of a retrieval is to determine the posterior $P(\mathbf{x}|\mathbf{y})$. The problem is that we usually cannot simply analytically (or even numerically) calculate this function.
- Instead, retrieval methods make use of the fact that we can straightforwardly calculate prior and likelihood values once a forward model has been defined and a set of parameter values has been selected.
- Commonly used retrieval setups then employ the so-called "nested sampling" method (and sometimes MCMC).
- All these methods do is to generate samples of $P(\mathbf{x}|\mathbf{y})$. So while we do not know the exact functional form of $P(\mathbf{x}|\mathbf{y})$, we know how to sample from it.
- For this sampling the forward model has to be called many (thousands to hundreds of millions of) times and compared to the data.
- Sample suggestions which are more probable, given the prior, and more consistent with the data, given the likelihood, are more likely to be accepted.
- The retrieval process is also called "inverting the forward model", and can be very computationally expensive.

### Bayesian vs. frequentist statistics¶

- You may have heard the term "Bayesian vs. frequentist statistics" before. Usually it is said that $\chi^2$-minimization is frequentist, while retrievals are Bayesian.
- What this really refers to is that retrievals calculate a posterior distribution for the model parameters using Bayes theorem. Frequentists say that this is not allowed because models (and their parameters) are not a "real thing" (in contrast to a measured data point), so it is not even allowed to think about their probability distribution.
- Rumor has it that frequentist and Bayesian statisticians have gotten into fist fights over this.
- I am not knowledgeable enough to understand the subtleties of this fight and will just carry on below.

## Model selection¶

- With retrieval techniques such as nested sampling it also becomes possible to answer this question: "
*Given the data, which of my tested forward models is the most likely one?*" - Here we do not mean which parameters of a given model are the most likely ones, but which model is the most likely one, among a set of tested models.
- Examples for different models would be one that uses an isothermal temperature profile for the atmosphere and one that uses a profile where the temperature varies as a function of altitude. The latter model has more free parameters, of course, namely those that describe the altitude dependence of the temperature.
- Another real-life example is where one model includes more molecules than another one. This is useful to test whether a certain molecule is needed to explain the data, and allows us to derive a detection significance.
- Note, however, that model selection does not allow us to find the correct or best model in absolute terms. All it allows is to select the most likely model from a set of models.
- It could be that there's an even more probable model that has not been explored yet in the model selection process. Alternatively put: there's also the danger to select the "best" model from a set of bad or unphysical models.

### Bayes factor¶

- Model selection is done via the so-called Bayes factor, which is defined as follows $$B_{12} = \frac{P(\mathbf{y}|\mathcal{M}_1)}{P(\mathbf{y}|\mathcal{M}_2)}.$$
- Here $P(\mathbf{y}|\mathcal{M}_i)$ is the probablity to observe the data $\mathbf{y}$, given that the physical processes giving rise to the data are described by model $\mathcal{M}_i$.
- Using Bayes' theorem, we can write $$ B_{12} = \frac{P(\mathcal{M}_1|\mathbf{y})}{P(\mathcal{M}_2|\mathbf{y})}\frac{P(\mathcal{M}_2)}{P(\mathcal{M}_1)}. $$
- If we assume that we do not have any prior information on which model is preferable ($P(\mathcal{M}_i)=P(\mathcal{M}_j) \ \forall \ i,j$), we get $$ B_{12} = \frac{P(\mathcal{M}_1|\mathbf{y})}{P(\mathcal{M}_2|\mathbf{y})}. $$
- This is what we want: the ratio of model probabilities, given the data.
- Now, the Nested Sampling method actually calculates $P(\mathbf{y}|\mathcal{M})$ during the retrieval which is also called
*evidence*$Z$ (and sometimes "marginal likelihood"). - The evidence $Z$ is calculated in nested sampling by numerically integrating \begin{align} Z & = P(\mathbf{y}|\mathcal{M}_1) \\ & = \int P(\mathbf{y}|\mathbf{x})P(\mathbf{x}|\mathcal{M})d\mathbf{x} \\ & = \int \underbrace{P(\mathbf{y}|\mathbf{x})}_{\rm likelihood}\underbrace{P(\mathbf{x})}_{\rm prior}d\mathbf{x}. \end{align}
- In the last line we dropped the (still implied) $\mathcal{M}$-dependence of the prior of $\mathbf{x}$.
- So here we encounter two old acquaintances again: the prior and the likelihood. Nested sampling thus does nothing other than numerically integrating the evidence $Z$ using a Monte Carlo scheme: it samples the prior volume to get model parameters, then calculates the likelihood, and adds it to a large sum ($Z$), multiplied by an estimate for the prior volume of this sample.
- Importance-sampling the drawn parameter values with a weight that equals $w=P(\mathbf{y}|\mathbf{x})P(\mathbf{x})\Delta\mathbf{x}$ gives rise to a valid posterior sample.
- A nice reference for learning more about nested sampling is this overview paper.
- In summary it thus holds that \begin{equation} B_{12} = \frac{Z_1}{Z_2}. \end{equation}

## Bayes factor analogy to Occam's razor¶

- The principle of Occam's razor basically states that "the simplest explanation is usually the best one".
- Model selection via Bayes factors implements this principle on the grounds of probability theory.
- Let's build some intuition based on a simple approximation of the evidence.
- For the evidence it holds that $$ Z = \int P(\mathbf{x})P(\mathbf{y}|\mathbf{x})d\mathbf{x}. $$
- We can approximate this integral by multipliying the peak value of $P(\mathbf{y}|\mathbf{x})$ with the width of this peak ($\Delta\mathbf{x}_{\rm peak}$): $$ Z \approx P(\mathbf{x}_{\rm peak})P(\mathbf{y}|\mathbf{x}_{\rm peak})\Delta\mathbf{x}_{\rm peak} . $$
- Now, let's assume that the priors on all parameters were uniform, then we have that $$ P(\mathbf{x}) = \prod_{i=1}^{N_{\rm param}}{\delta} x_i^{-1}, $$ where $N_{\rm param}$ is the number of free parameters, and ${\delta} x_i$ is the prior width of parameter $i$.
- Likewise, assume that $$\Delta\mathbf{x}_{\rm peak} = \prod_{i=1}^{N_{\rm param}}{\Delta} x_i,$$ where ${\Delta} x_i$ is the posterior width of parameter $i$. This means that we assume that all parameters are independent.
- It thus holds that $$ Z \approx P(\mathbf{y}|\mathbf{x}_{\rm peak})\prod_{i=1}^{N_{\rm param}}\frac{{\Delta} x_i}{{\delta} x_i}. $$
- Let's now assume that we compare two models $\mathcal{M}_1$ and $\mathcal{M}_2$, which only differ in that $\mathcal{M}_2$ has one additional, completely useless, parameter $x_{\rm add}$ (i.e., a parameter that has no effect whatsoever on the forward model).
- It then holds that the best-fit models of $\mathcal{M}_1$ and $\mathcal{M}_2$ are identical, such that $$ Z_2 = Z_1 \frac{\Delta x_{\rm add}}{\delta x_{\rm add}}. $$
- Because the parameter has no influence on the model, it will be unconstrained, and the inferred posterior width will be equal to the prior width, $\Delta x_{\rm add}=\delta x_{\rm add}$.
- In this case we thus have $$ B_{12} = 1, $$ since $Z_1=Z_2$.
- We thus see that models $\mathcal{M}_1$ and $\mathcal{M}_2$ are equally likely, so there's no need to consider the more complicated model $\mathcal{M}_2$, which has one additional parameter.
- In general an additional parameter always makes the model more flexible and may thus lead to a somewhat better fit, so $$ P(\mathbf{y}|\mathbf{x}'_{\rm peak},\mathcal{M}_2) > P(\mathbf{y}|\mathbf{x}_{\rm peak},\mathcal{M}_1), $$ where ${x}'_{\rm peak}$ is the maximum-likelihood parameter value of $\mathcal{M}_2$, which is different from ${x}_{\rm peak}$, the maximum-likelihood parameter value of $\mathcal{M}_1$.
- Whether or not $\mathcal{M}_2$ is preferred over $\mathcal{M}_1$ then depends on whether $\Delta x_{\rm add}/\delta x_{\rm add}\leq 1$ is large enough. In general, adding free parameters decreases the evidence if the fit (maximum likelihood) does not significantly improve. This is because additional terms $\Delta x/\delta x\leq 1$ are multiplied with the likelihood.
- We also see another property of the Bayes factor: in the simple approximation above we get that $$ \frac{\partial Z}{\partial {\delta} x_{\rm i}} < 0. $$
- This means that wider priors will lead to lower evidences. The Bayes factor thus penalizes ignorance: if we do not know which prior range to take for a model, and just make it very (too) wide, the Bayes factor will be lower, and the model will be less favorable.
- Another intricacy that I want to mention here is that more complex models often have more correlation between the model parameters (which we completely neglected above). This results in 1-d marginalized posterior distributions which are wider (which could increase the evidence in our simple model). However, in addition to more $\Delta x/\delta x\leq 1$ terms being multiplied, the posterior volume of a more strongly correlated distribution is actually smaller than the one of a non-correlated distribution. This effect can decrease the evidence.
- Bayes factors of 3, 10, and 150 are sometimes interpreted as "weak", "moderate", and "strong" preference for the winning model. But also note that $B=150$ can be converted to a 3.6$\sigma$ "detection" of the winning model, while a $5\sigma$ "detection" requires $B=43000$ (Benneke & Seager 2013).