next up previous
Next: Maximum likelihood and least Up: AY630 class notes Previous: Data simulation

Subsections

Statistical inference

Let us turn to statistical inference, the process of drawing conclusions from data. This is often done in the context of fitting a model to data.

Overview: frequentism vs Bayesian

Given a set of observations/data, one often wants to summarize and get at underlying physics by fitting some sort of model to the data. The model might be an empirical model or it might be motivated by some underlying theory. In many cases, the model is parametric: there is some number of parameters that specify a particular model out of a given class.

The general scheme for doing this is to define some merit function that is used to determine the quality of a particular model fit, and choose the parameters that provide the best match, e.g. a minimum deviation between model and data, or a maximum probability that a given set of parameters matches the data. When doing this one also wants to understand something about how reliable the derived parameters are, and also about how good the model fit actually is, i.e. to what extent it is consistent with your understanding of uncertainties on the data.

There are two different ``schools" about how to think about model fitting: frequentism and Bayesian. In the frequentist approach (sometimes called the classical approach), one considers how frequently one might expect to observe a given data set given some underlying model (i.e., P(D| M), the probability of observing a data set given a model); if very infrequently, then the model is rejected. The model that ``matches" the observed data most frequently (based on some calculated statistic) is viewed as the correct underlying model, and as such, gives the best parameters, along with some estimate of their uncertainty. Here, uncertainty refers to the range of different observed data sets that are consistent with a given model.

The Bayesian approach considers the probability that a given model is correct given a set of data (i.e. P(M| D), the probability of a model given a dataset). It allows for the possibility that there is external information that may prefer one model over another, and this in incorporated into the analysis as a prior. It considers the probability of different models, so talks about probability distribution functions of parameters. Examples of priors: fitting a Hess diagram with a combination of SSPs, with external constraints on allowed ages; fitting UTR data for low count rates in the presence of readout noise.

The frequentist paradigm has been mostly used in astronomy up until fairly recently, but the Bayesian paradigm has become increasingly widespread. In many cases, they can give the same result, but with somewhat different interpretation, but in some cases, results can differ.

Bayesian vs. Frequentist Statistics Suppose we have measured the mean mass of a sample of G stars, by some method, and say: at the 68% confidence level the mean mass of G stars is a ± b. What does this statement mean?

Bayesian answer: There is some true mean mass, M, of G stars, and there is a 68% probability that ab < M < a + b. More pedantically: The hypothesis that the true mean mass, M, of G stars lies in the range a − b to a + b has a 68% probability of being true. The probability of the hypothesis is a real-numbered expression of the degree of belief we should have in the hypothesis, and it obeys the axioms of probability theory.

In “classical” or “frequentist” statistics, a probability is a statement about the frequency of outcomes in many repeated trials. With this restricted definition, one can’t refer to the probability of a hypothesis — it is either true or false. One can refer to the probability of data if a hypothesis is true, where probability means the fraction of time the data would have come out the way it did in many repeated trials. So the statement means something like: if α M = a, we would have expected to obtain a sample mean in the range a ± b 68% of the time. This is the fundamental conceptual difference between Bayesian and frequentist statistics.

Bayesian: Evaluate the probability of a hypothesis in light of data (and prior information). Parameter values or probability of truth of a hypothesis are random variables, data are not (though they are drawn from a pdf). Frequentist: Evaluate the probability of obtaining the data — more precisely, the fraction of times a given statistic (such as the sample mean) applied to the data would come out the way it did in many repeated trials — given the hypothesis, or parameter values. Data are random variables, parameter values or truth of hypotheses are not.

An opinion: The Bayesian formulation corresponds better to the way scientists actually think about probability, hypotheses, and data. It provides a better conceptual basis for figuring out what to do in a case where a standard recipe does not neatly apply. But frequentist methods sometimes seem easier to apply, and they clearly capture some of our intuition about probability.

Criticism of Bayesian approach: The incorporation of priors makes Bayesian methods seem subjective, and it is the main source of criticism of the Bayesian approach. If the data are compelling and the prior is broad, then the prior doesn’t have much effect on the posterior. But if the data are weak, or the prior is narrow, then it can have a big effect. Sometimes there are well defined ways of assigning an “uninformative” prior, but sometimes there is genuine ambiguity. Bayesian methods sometimes seem like a lot of work to get to a straightforward answer. In particular, we sometimes want to carry out an “absolute” hypothesis test without having to enumerate all alternative hypotheses.

Criticism of frequentist approach: doesn’t correspond as well to scientific intuition. We want to talk about the probability of hypotheses or parameter values. The choice of which statistical test to apply is often arbitrary. There is not a clear way to go from the result of a test to an actual scientific inference about parameter values or validity of a hypothesis. Bayesians argue that frequentist methods obtain the appearance of objectivity only by sweeping priors under the rug, making assumptions implicit rather than explicit.

Frequentist approach relies on hypothetical data as well as actual data obtained. Choice of hypothetical data sets is often ambiguous, e.g., in the “stopping” problem. Sometimes we do have good prior information. It is straightforward to incorporate this in a Bayesian approach, not so in frequentist. Frequentist methods are poorly equipped to handle “nuisance parameters,” which in Bayesian approach are easily handled by marginalization. For example, the marginal distribution of a parameter x p(x) = Z p(x|a, b, c)da db dc can only exist if x is a random variable.

In practice, frequentist analysis yields parameters and their uncertainties, while Bayesian analysis yields probability distribution functions of parameters. The latter is often more computational intensive to calculate. Bayesian analysis includes explicit priors.

For some more detailed discussion, see http://arxiv.org/abs/1411.5018, https://en.wikipedia.org/wiki/Bayesian_probability, Tom Loredo's Bayesian Reprints

In the context of Bayesian analysis, we can talk about the probability of models, and we have:

P(M| D) = $\displaystyle {P(D\vert M) P(M) \over P(D)}$

More accurately, we might want to talk about the probability of parameters of a given model, i.e.

P(θ| DM) = $\displaystyle {P(D\vert\theta M) P(\theta \vert M) \over P(D\vert M)}$

where we might also consider different models with different parameters.

P(M| D) is called the posterior probability. This gives the probability distribution of the model or model parameters.

P(D| M) is the likelihood, the probability that we would observe the data given a model. Calculation of likelihood, P(D| M), is sometimes straightforward, sometimes difficult. The background information may specify assumptions like a Gaussian error distribution, independence of data points. Important aspect of Bayesian approach: only the actual data enter, not hypothetical data that could have been taken. All the evidence of the data is contained in the likelihood.

P(M) is the prior. In the absence of any additional information, this is equivalent for all models (i.e. a noninformative prior). But what equivalent means if the model has parameters is ambiguous: for example, note that a uniform prior is not necessarily invariant to a change in variables, e.g. uniform in the logarithm of a variable is not uniform in the variable! The Bayesian would say that even a ``non-informative" prior is itself a prior. In the noninformative case, the Bayesian result is the same as the frequentist maximum likelihood. In the Bayesian analysis, however, we'd calculate the full probability distribution function for the model parameter, μ. More on that later. Note that a uniform prior in any variable is an improper probability, i.e., it does not normalize to unity. Such a prior causes problems for interpretation of the posterior, because it then cannot be normalized. If the posterior can be normalized, this allows one to consider not just the peak of the posterior PDF (the most likely model), but also the mean, interquartile range, etc. When the PDF is normalizable, Bayesian analysis often talks about the ``credible region," which is comparable to the ``confidence intervals" used in frequentist analysis.

P(D) is the evidence. In many cases, this can just be viewed as a normalization constant so that the posterior sums to unity. However, it can be important if we consider the comparison of different models, where we might distinguish between parameters of one model vs those of another; in this case with might write P(θ| M, D), etc. where θ are the parameters of model M. The global likelihood of the data, P(D| M) is the sum (or integral) over “all” hypotheses. Often P(D| M) doesn’t matter: in comparing hypotheses or parameter values, it cancels out. When needed, it can often be found by requiring that p(M| D) integrate (or sum) to one, as it must if it is a true probability. The Bayesian approach forces specification of alternatives to evaluate hypotheses. Frequentist assessment tends to do this implicitly via the choice of statistical test.


\begin{shaded}
\textit{
Understand the basic conceptual differences between a fr...
...lysis. Know Bayes' theorem. Understand the practical
differences.
}
\end{shaded}

A simple example of where one might see the value of the Bayesian approach: stellar parallaxes with uncertainties. See Bailer-Jones 3.5:

Consider the situation where we have a measurement of a parallax: how do you get the distance? Now, however, consider that the measurement has some uncertainty, so we want to determine the range of distances that are consistent with the observation.

An initial approach might be to give the most probable distance, and estimate its uncertainty by propagation of uncertainties:

d = $\displaystyle {1\over p}$

σd2 = $\displaystyle \left(\vphantom{{1\over p^2}}\right.$$\displaystyle {1\over p^2}$$\displaystyle \left.\vphantom{{1\over p^2}}\right)^{2}_{}$σp2

However, it's clear that if the uncertainty in p is Gaussian, the uncertainty is d is not. Furthermore, for the most distant objects, it's not impossible that one would measure a negative parallax: what do you do with that, recognizing that it does contain some information?

Bailer-Jones Fig 3.4 demonstrates the situation: the real uncertainties are not Gaussian. And even with the real uncertainties, you can end up with negative distances!

Posing this as a Bayesian inference problem allows an alternate path.

P(d| p) = $\displaystyle {P(p\vert d)P(d) \over P(p)}$

If we use a uniform prior (all distances are equally likely), then this is an improper prior. In this case, we can't normalize the posterior probability distribution function, so all we can really use from it is the maximum P(d| p), which will be d = 1/p.

Bailer-Jones 3.5

But a uniform prior is not realistic! For one, we know we can't have negative distances, so at a bare minimum, one wants to set P(d )= 0 for d < 0. But this doesn't help that much, as we wtill have a unnormalizable posterior. But consider that there must be some selection function to enable a parallax measurement at all, e.g., a star must be brighter than some limit for it to be detected and measured. Given that we know that stars have maximum intrinsic luminosities, this yields a maximum distance. Furthermore, we know that stars are not uniformly distributed in space. Finally, the volume element for a star grows as d2.

For the case of a uniform prior in r but with a maximum distance, rlim, see Bailer-Jones 3.6


next up previous
Next: Maximum likelihood and least Up: AY630 class notes Previous: Data simulation