Let us turn to *statistical inference*, the process of drawing
conclusions from data. This is often done in the context of fitting a
model to data.

Given a set of observations/data, one often wants to summarize and get at underlying physics by fitting some sort of model to the data. The model might be an empirical model or it might be motivated by some underlying theory. In many cases, the model is parametric: there is some number of parameters that specify a particular model out of a given class.

The general scheme for doing this is to define some merit function that is used to determine the quality of a particular model fit, and choose the parameters that provide the best match, e.g. a minimum deviation between model and data, or a maximum probability that a given set of parameters matches the data. When doing this one also wants to understand something about how reliable the derived parameters are, and also about how good the model fit actually is, i.e. to what extent it is consistent with your understanding of uncertainties on the data.

There are two different ``schools" about how to think about model fitting:
frequentism and Bayesian. In the frequentist approach (sometimes called
the classical approach), one considers how frequently one might expect
to observe a given data set given some underlying model (i.e., *P*(*D*| *M*), the
probability of observing a data set given a model); if very infrequently,
then the model is rejected. The model that ``matches" the observed data
most frequently (based on some calculated statistic) is viewed as the
correct underlying model, and as such, gives the best parameters, along
with some estimate of their uncertainty. Here, uncertainty refers to the
range of different observed data sets that are consistent with a given
model.

The Bayesian approach considers the probability that a given model
is correct given a set of data (i.e. *P*(*M*| *D*), the probability of a
model given a dataset). It allows for the possibility that there is
external information that may prefer one model over another, and this
in incorporated into the analysis as a *prior*. It considers the
probability of different models, so talks about probability distribution
functions of parameters. Examples of priors: fitting a Hess diagram
with a combination of SSPs, with external constraints on allowed ages;
fitting UTR data for low count rates in the presence of readout noise.

The frequentist paradigm has been mostly used in astronomy up until fairly recently, but the Bayesian paradigm has become increasingly widespread. In many cases, they can give the same result, but with somewhat different interpretation, but in some cases, results can differ.

Bayesian vs. Frequentist Statistics Suppose we have measured the mean mass of a sample of G stars, by some method, and say: at the 68% confidence level the mean mass of G stars is a ± b. What does this statement mean?

Bayesian answer: There is some true mean mass, *M*, of G stars, and there is
a 68% probability that
*a*−*b* < *M* < *a* + *b*. More pedantically: The
hypothesis that the true mean mass, *M*, of G stars lies in the range a −
b to a + b has a 68% probability of being true. The probability of the
hypothesis is a real-numbered expression of the degree of belief we should
have in the hypothesis, and it obeys the axioms of probability theory.

In “classical” or “frequentist” statistics, a probability is a
statement about the frequency of outcomes in many repeated trials. With
this restricted definition, one can’t refer to the probability of
a hypothesis — it is either true or false. One can refer to the
probability of data if a hypothesis is true, where probability means
the fraction of time the data would have come out the way it did in many
repeated trials. So the statement means something like: if α *M* = *a*, we
would have expected to obtain a sample mean in the range a ± b 68% of
the time. This is the fundamental conceptual difference between Bayesian
and frequentist statistics.

Bayesian: Evaluate the probability of a hypothesis in light of data (and prior information). Parameter values or probability of truth of a hypothesis are random variables, data are not (though they are drawn from a pdf). Frequentist: Evaluate the probability of obtaining the data — more precisely, the fraction of times a given statistic (such as the sample mean) applied to the data would come out the way it did in many repeated trials — given the hypothesis, or parameter values. Data are random variables, parameter values or truth of hypotheses are not.

An opinion: The Bayesian formulation corresponds better to the way scientists actually think about probability, hypotheses, and data. It provides a better conceptual basis for figuring out what to do in a case where a standard recipe does not neatly apply. But frequentist methods sometimes seem easier to apply, and they clearly capture some of our intuition about probability.

Criticism of Bayesian approach: The incorporation of priors makes Bayesian methods seem subjective, and it is the main source of criticism of the Bayesian approach. If the data are compelling and the prior is broad, then the prior doesn’t have much effect on the posterior. But if the data are weak, or the prior is narrow, then it can have a big effect. Sometimes there are well defined ways of assigning an “uninformative” prior, but sometimes there is genuine ambiguity. Bayesian methods sometimes seem like a lot of work to get to a straightforward answer. In particular, we sometimes want to carry out an “absolute” hypothesis test without having to enumerate all alternative hypotheses.

Criticism of frequentist approach: doesn’t correspond as well to scientific intuition. We want to talk about the probability of hypotheses or parameter values. The choice of which statistical test to apply is often arbitrary. There is not a clear way to go from the result of a test to an actual scientific inference about parameter values or validity of a hypothesis. Bayesians argue that frequentist methods obtain the appearance of objectivity only by sweeping priors under the rug, making assumptions implicit rather than explicit.

Frequentist approach relies on hypothetical data as well as actual data obtained. Choice of hypothetical data sets is often ambiguous, e.g., in the “stopping” problem. Sometimes we do have good prior information. It is straightforward to incorporate this in a Bayesian approach, not so in frequentist. Frequentist methods are poorly equipped to handle “nuisance parameters,” which in Bayesian approach are easily handled by marginalization. For example, the marginal distribution of a parameter x p(x) = Z p(x|a, b, c)da db dc can only exist if x is a random variable.

In practice, frequentist analysis yields parameters and their uncertainties, while Bayesian analysis yields probability distribution functions of parameters. The latter is often more computational intensive to calculate. Bayesian analysis includes explicit priors.

For some more detailed discussion, see http://arxiv.org/abs/1411.5018, https://en.wikipedia.org/wiki/Bayesian_probability, Tom Loredo's Bayesian Reprints

In the context of Bayesian analysis, we can talk about the probability of models, and we have:

More accurately, we might want to talk about the probability of parameters of a given model, i.e.

*P*(*M*| *D*) is called the *posterior probability*. This gives
the probability distribution of the model or model parameters.

*P*(*D*| *M*) is the *likelihood*, the probability that we would observe
the data given a model.
Calculation of likelihood, *P*(*D*| *M*), is sometimes straightforward,
sometimes difficult. The background information may specify assumptions
like a Gaussian error distribution, independence of data points.
Important aspect of Bayesian approach: only the actual data enter, not
hypothetical data that could have been taken. All the evidence of the
data is contained in the likelihood.

*P*(*M*) is the *prior*. In the absence of any additional information,
this is equivalent for all models (i.e. a *noninformative* prior).
But what equivalent means if the model has parameters is ambiguous:
for example, note that a uniform prior is
not necessarily invariant to a change in variables, e.g. uniform
in the logarithm of a variable is not uniform in the variable!
The Bayesian would say that even a ``non-informative" prior is itself a prior.
In the noninformative case, the Bayesian result is the same as the frequentist
maximum likelihood. In the Bayesian analysis, however, we'd
calculate the full probability distribution function for the model
parameter, *μ*. More on that later. Note that a uniform prior in
any variable is an improper probability, i.e., it does not normalize
to unity. Such a prior causes problems for interpretation of the posterior,
because it then cannot be normalized. If the posterior can be normalized,
this allows one to consider not just the peak of the posterior PDF (the
most likely model), but also the mean, interquartile range, etc. When
the PDF is normalizable, Bayesian analysis often talks about the
``credible region," which is comparable to the ``confidence intervals"
used in frequentist analysis.

*P*(*D*) is the *evidence*. In many cases, this can just be viewed
as a normalization constant so that the posterior sums to unity. However,
it can be important if we consider the comparison of different models,
where we might distinguish between *parameters* of one model
vs those of another; in this case with might write
*P*(*θ*| *M*, *D*), etc.
where *θ* are the parameters of model *M*.
The global likelihood of the data, *P*(*D*| *M*) is the sum (or integral) over
“all” hypotheses. Often *P*(*D*| *M*)
doesn’t matter: in comparing hypotheses or parameter values, it cancels
out. When needed, it can often be found by requiring that *p*(*M*| *D*)
integrate (or sum) to one, as it must if it is a true probability.
The Bayesian approach forces specification of alternatives to evaluate
hypotheses. Frequentist assessment tends to do this implicitly via the
choice of statistical test.

A simple example of where one might see the value of the Bayesian approach: stellar parallaxes with uncertainties. See Bailer-Jones 3.5:

Consider the situation where we have a measurement of a parallax: how do you get the distance? Now, however, consider that the measurement has some uncertainty, so we want to determine the range of distances that are consistent with the observation.

An initial approach might be to give the most probable distance, and estimate its uncertainty by propagation of uncertainties:

However, it's clear that if the uncertainty in *p* is Gaussian, the
uncertainty is *d* is not. Furthermore, for the most distant objects, it's
not impossible that one would measure a negative parallax: what do you
do with that, recognizing that it does contain some information?

Bailer-Jones Fig 3.4 demonstrates the situation: the real uncertainties are not Gaussian. And even with the real uncertainties, you can end up with negative distances!

Posing this as a Bayesian inference problem allows an alternate path.

If we use a uniform prior (all distances are equally likely), then this
is an improper prior. In this case, we can't normalize the posterior
probability distribution function, so all we can really use from it is
the maximum *P*(*d*| *p*), which will be *d* = 1/*p*.

But a uniform prior is not realistic! For one, we know we can't have
negative distances, so at a bare minimum, one wants to set *P*(*d* )= 0 for *d* < 0.
But this doesn't help that much, as we wtill have a unnormalizable
posterior. But consider that there must be some selection function to
enable a parallax measurement at all, e.g., a star must be brighter than
some limit for it to be detected and measured. Given that we know that
stars have maximum intrinsic luminosities, this yields a maximum distance.
Furthermore, we know that stars are not uniformly distributed in space.
Finally, the volume element for a star grows as *d*^{2}.

For the case of a uniform prior in *r* but with a maximum distance, *r*_{lim},
see Bailer-Jones 3.6