Next: Bayesian inference Up: AY630 class notes Previous: Statistical inference

Subsections

Maximum likelihood and least squares

Let's start with an analysis of the likelihood, which is a common part of a frequentist analysis and also a component of the Bayesian analysis. Given a set of data and some model (with parameters), one considers the probability that the set of data will be observed if the model is correct, and we will prefer the set of parameters that maximizes this probability.

Consider the example of fitting a straight line to a set of data. How is the best fit line usually determined? Where does this come from? Consider that observed data points have some measurement uncertainty, σ. For now consider that all points have the same uncertainty, i.e. homoschedastic uncertainties, and that the uncertainties are distributed according to a Gaussian distribution.

What is the probability of observing a given data value from a known function y = ax + b, given an independent variable xi, and a known Gaussian distribution of uncertainty in y? What's the probability of observing a series of two independently drawn data values?

The probability of observing a series of N data points is, given some model y(xi| aj), where the aj represent a set of M parameters:

P(D| M)∝expΔy

where aj are some number of parameters of a model, and Δy is some small range in y.

Maximizing the probability is the same thing as maximizing the logarithm of the probability, which is the same thing as minimizing the negative of the logarithm, i.e. minimize:

-log P(D| M) = - Nlog(Δy) + const

If σ is the same for all points, then this is equivalent to minimizing

(yi - y(xi| aj))2

i.e., least squares.

Consider a simple application, multiple measurements of some quantity with the same uncertainty, σ, on all points (homoschedastic uncertainties), so the model is y(xi) = μ where we want to determine the most probable value of the quantity, i.e. μ. What's the answer?

Minimizing gives:

= 2(yi - μ) = 0

μ =

which is a familiar result! This validates the concept of using the moment calculation of the sample mean.

What about the uncertainty on the mean? Just use error propagation, for x = f (u, v,...),

σ(μ)2 = σy02 + σy12 + ...

σμ2 = =

σμ =

What about the case of unequal uncertainties on the data points, i.e. heteroschedastic uncertainties? Then

-log P(D| M) = + const

Here one is minimizing a quantity X2:

X2 =

which is often referred to as χ2, but perhaps incorrectly. An important thing about χ2 is that the probability of a given value of χ2 given N data points and M parameters can be calculated analytically, and is called the chi-square distribution for ν = N - M degrees of freedom; see, e.g. scipy.stats.chi2, e.g. cumulative density function (cdf), which you want to be in some range, e.g., 0.05-0.95. (Note, however, that there may be some question whether the X2 as defined above necessarily follows the statistical χ2 distribution, see discussion in Feigelson & Babu, also bottom of https://asaip.psu.edu/forums/astrostatistics-forum/811087908. The maximum likelihood part is still fine, it's just the evaluation of the quality using a χ2 test that might not be).

Often, people qualititatively judge the quality of a fit by the reduced χ2, or χ2 per degree of freedom:

χν2 =

which is expected to have a value near unity (but really, you should calculate the probability of χ2, since the spread in χν2 depends on ν (standard deviation is ).

It is important to recognize that this analysis depends on the assumption that the uncertainties are distributed according to a normal (Gaussian) distribution!

Using χ2 as a method for checking uncertainties, or even determining them (at the expense of being able to say whether your model is a good representation of the data!). See ML 4.1

Back to the case of unequal uncertainties on the data points. For our simple model of measuring a single quantity, we have:

= 2 = 0

- μ = 0

μ =

i.e., a weighted mean. Again, you can use error propagation to get (work not shown):

σμ =

Let's move on to a slightly more complicated example: fitting a straight line to a set of data. Here, the model is

y(xi) = a + bxi

where we now have two parameters (a and b) and χ2 is:

χ2 =

It should be clear that this is quadratic surface as a function of the two parameters, a and b.

Again, we want to minimize χ2, so we take the derivatives with respect to each of the parameters and set them to zero:

= 0 = - 2

= 0 = - 2

Separating out the sums, we have:

- a - b = 0

- a - b = 0

or

aS + bSx = Sy

aSx + bSxx = Sxy

where the various S are a shorthand for the sums:

S =

Sx =

Sxy =

Sxx =

Sy =

which is a set of two equations with two unknowns. Note that you could equivalently write the two equations in matrix format:

=

where you can note that the square matrix is the sum of the product of derivatives with respect to each parameter, and the column vector is the sum of the product of the derivative with respect to each parameter times the observed value.

Solving algebraically, you get:

a =

b =

We also want the uncertainties in the parameters. Again, use propagation of errors to get (work not shown!):

σa2 =

σb2 =

Finally, we will want to calculate the probability of getting χ2 for our fit, in an effort to understand 1) if our uncertainties are not properly calculated and normally distributed, or 2) if our model is a poor model.

Note that there can be covariances between the parameters! Consider the case where the values are far from the origin. The covariance is given by :

σab = - σb2

It can be elimated by shifting the coordinate system so that = 0.

General linear fits

We can generalize the least squares idea to any model that is some linear combination of terms that are a function of the independent variable, e.g.

y = a0 + a1f1(x) + a2f2(x) + a3f3(x) + ...

such a model is called a linear model, because it is linear in the parameters (but not at all necessarily linear in the independent variable!). The model could be a polynomial of arbitrary order, but could also include trignometric functions, etc. We write the model in a simple form:

y(x) = akfk(x)

where there are M parameters. We will still have N data points and, as you might imagine, we require N > M.

We can write the merit function as:

X2 =

Then, minimizing X2 leads to the set of M equations:

0 = yi - ajfj(xi)fk(xi)

where k = 0,...., M - 1. Note that the fk(x) are the derivatives of the function y(x) with respect to each of the k parameters.

Separating the terms and interchanging the order of the sums, we get:

=

Define:

αjk =

βj =

then we have the set of equations:

αjkaj = βk

for k = 0,....M - 1.

The form of these equations can be simplified even further by casting them in terms of the design matrix, A, which consists of N measurements of M terms:

Aij =

with N rows and M columns. Along with the definition:

bi =

we have:

α = ATA

which is an M by M matrix,

β = ATb

which is a vector of length M, and

ATAaj = ATb

where the dots are for the matrix operation that produces the sums. This is just another notation for the same thing, introduced here in case you run across this language or formalism.

In fact, this can be even further generalized to allow for covariances, σij in the data:

Aij = fj(xi)

Cii = σi2

Cij = σij

bi = yi

ATC-1AaJ = ATC-1b

For a given data set, we can calculate α and β either by doing the sums or by setting up the design matrix and using matrix arithmetic. We then need to solve the set of equations for ak.

Solving linear equations

This is just a linear algebra problem, and there are well-developed techniques (see Numerical Recipes chapter 2). Most simply, we just invert the matrix α to get

ak = αjk-1βk

or, with the design matrix formalism

ak = (ATA)-1(ATB)

A simple algorithm for inverting a matrix is called Gauss-Jordan elimination, but there are others. In particular, NR recommends the use of singular value decomposition for solving all but the simplest least-squares problems; this is especially important if your problem is nearly singular, i.e. where two or more of the equations may not be totally independent of each other: fully singular problems should be recognized and redefined, but it is possible to have non-singular problems encounter singularity under some conditions depending on how the data are sampled. See NR 15.4.2 and Chapter 2!

As an aside, note that it is possible to solve a linear set of equations significantly faster than inverting the matrix through various types of matrix decomposition, e.g., LU decomposition and Cholesky decomposition (again, see NR Chapter 2). There may be linear algebra problems that you encounter that only require the solution of the equations and not the inverse matrix. However, for the least squares problem, we often actually do want the inverse matrix C = α-1, because its elements have meaning. Propagation of errors gives:

σ2(ak) = Cjj

i.e., the diagonal elements of the inverse matrix. The off-diagonal elements give the covariances between the parameters, and the inverse matrix, C, is called the covariance matrix.

A Python implementation of matrix inversion and linear algebra in general can be found in scipy.linalg; linear least squares is implemented in scipy.linalg.leastsq(), but is really just an implementation of the matrix formulation.

For the specific linear case of a polynomial fit, you can use numpy.polyfit(). For the formulation of a design matrix for a polynomial fit, see numpy.vander().

Linear least squares is also implemented in astropy in the modeling module using the LinearLSQFitter class (which uses numpy.linalg); astropy has a number of standard models that you might use, or you can provide your own custom model.

from astropy.modeling import models, fitting
fit_p = fitting.LinearLSQFitter()
p_init = models.Polynomial1D(degree=degree)
pfit = fit_p(p_init, x, y)

Astropy has a number of common models, but you can also define your own model using models.custom_model, by supplying a function that returns values, and a function that returns derivatives with respect to the parameters.

Let's do it. Simulate a data set, fit a line, and output parameters, parameter uncertainties, chi2, and probability of chi2. See leastsquares.ipynb

Linear problems with lots of parameters

Not all linear problems are just functional fits with small numbers of parameters A good example is spectral extraction for a multiobject fiber spectrograph. How is this problem formulated in the context of least squares? What does the α matrix look like? What needs to be solved? Note spectro-perfectionism example!

Note that for problems with large numbers of parameters, the linear algebra can become very computationally expensive. In many cases, however, the matrix that represents these problems may only be sparsely populated, i.e. so-called sparse matrices (see Numerical Recipes 2.7, Figure 2.7.1).

In such cases, there are methods that allow the equations to be solved significantly more efficiently. See LAPACK, etc.

Nonlinear fits

However, not all fits are linear in the parameters .... which of the following are not?

• y = a0x + a1e-x
• y = a0(1 + a1x)
• y = a0sin(x - a1)
• y = a0sin2x
• y = a0e(-(x-a1)2/a22)

In a linear fit, the χ2 surface is parabolic in parameter space, with a single minimum, and linear least squares can be used to determine the location of the minimum. In the non-linear case, however, the χ2 surface can be considerably more complex, and there is a (good) possibility that there are multiple minima, especially when there are multiple parameters, so one needs to be concerned about finding the global minimum and not just a local minimum; see here for some examples. Because of the complexity of the surface, the best fit can't be found with a simple calculation as we could for the linear fit.

A conceptually simple approach would be a grid search, where one simply tries all combinations of parameters and finds the one with the lowest χ2. This can extremely computationally expensive, especially for problems with more than a few parameters! One is also forced to decide on a step size in the grid, although you might imagine a successively refined grid as you proceed.

Better approaches attempt to use the χ2 values to find the minimum more efficiently. They can generally be split into two classes: those that use the derivative of the function and those that don't. If you can calculate derivatives of χ2 with respect to your parameters, then this provides information about how far you can move in parameter space towards the minimum. If you can't calculate derivatives, you can evaluate χ2 at several different locations, and use these values to try to work your way towards the minimum.

Around the final minimum, the χ2 surface can be approximated as a parabola, and it is possible to correct the solution to the minimum solution if one can arrive at a set of parameters near to the final minimum, where the parabolic approximation is good. Instead of solving for the parameters, one solves for corrections to a set of initial parameters, where the data matrix is now the different between the observed data and the model with the set of initial parameters. For details, see NR 15.5 or Robinson Chapter 6.

One arrives at a set of equations closely parallel to the equations for the linear case:

αklδal = δβk

similar to what we had for the linear problem, but note that the data array ( δβ) is now the difference between the data and the current guess model, and we are calculating corrections to the current guess parameters rather than being able to solve directly for the parameters.

δβk = - =

αkl = = - (yi - y(xi| a))

The matrix αkl is known as the curvature matrix. In most implementations, the second derivative term is dropped because it is negligible near the minimum because the sum over the yi - y(xi| a) term should cancel; see NR 15.5.1 for a partial explanation. The standard approach then uses:

αkl =

To implement this, you choose a starting guess of parameters, solve for the corrections δal to be applied to a current set of parameters, and iterates until one arrives at a solution that does not change significantly.

Far from the solution, this method can be significantly off in providing a good correction, and, in fact, can even move parameters away from the correct solution. In this case, it may be necessary to simply head in the steepest downhill direction in χ2 space, which is known as the method of steepest descent. Note that while this sounds like a reasonable thing to do in all cases, it can be very inefficient in finding the final solution (see, e.g. NR Figure 10.8.1).

A common algorithm switches between the method of steepest descent and the parabolic approximation and is known as the Levenberg-Marquardt method. This is done by using a modified curvature matrix:

αklδal = βk

where

αjj′ = αjj(1 + λ)

αjk′ = αjk(forj≠k)

When λ is large, this gives a steepest descent method, with a larger λ corresponding to a smaller step; when small, a parabolic method. This leads to the following recipe:

1. choose starting guess of parameter vector (a) and caluculate χ2(a)
2. calculate correction using model with small λ, e.g., λ = 0.001, and evaluate χ2 at new point
3. if χ2(a + δa) > χ2(a), increase λ and try again
4. if χ2(a + δa) < χ2(a), decrease λ, update a, and start the next step
5. stop when convergence criteria are met
6. invert curvature matrix to get parameter uncertainties

This method, along with several others, is implemented in scipy.optimize.curve_fit(), where one provides input data and a model function.

The Levenberg-Marquardt method in also implemented in astropy in the modeling module using the LevMarLSQFitter which is used identically to the linear least squares fitter described above.

Two key issues in nonlinear least squares are finding a good starting guess and a good convergence criterion. You may want to consider multiple starting guesses to verify that you're not converging to a local minimum. For convergence, you can look at the amplitude of the change in parameters or the amplitude in the change of χ2.

Example: fitting the width and/or center of a gaussian profile

Example: crowded field photometry: given an observation of a field with mulitple stars that are sufficiently close to one another that light from different stars overlaps, how do we infer the brightnesses of individual stars? The basic idea is that if we can measure the shapes (PSFs) of an isolated star and make the assumption that all stars have the same shift, and we will also assume that we can identify distinct stars, but not their exact positions. Given these, we can formulate this as a parameterized inference problem. The observed data is the distribution of intensities across all of the pixels. What is the model?

Multivariate fits

For many problems, we may have data in more that one dimension, i.e. a multivariate problem, where we might want to look for relations between the different variables. In this case, rather than fitting a line to a paired set of points, we might want to fit a plane to a series of data points f (x, y), or even some higher number of dimensions.

Note that the general linear least squares formulation can be applied to problems with multiple independent variables, e.g., fitting a surface to a set of points; simply treat x as a vector of data, and the formulation is exactly the same.

Least squares with uncertainties on both axes

It is sometimes the case that one might want to look for a relationship between two variables, each of which has measurement uncertainty or intrinsic scatter. Ordinary least squares treats one variable as independent (and error-free) and one as dependent. Note Tully-Fisher example.

If you fit the model as x(y) rather than y(x) you do not in general get the same solution. There are several possible ways to deal with the situation. One is to do the forward and inverse fits and take the line that bisects them. Another is to fit for the minimum least-squares deviation perpendicular to the fitted relation (see Figure). See Feigelson and Babu 7.3.2. Note the discussion there about astronomers inventing a statistic ... which turns out to be OK in this case!

Alternatively, use a Bayesian approach ....

Uncertainties and the Cramer Rao bound

The matrix α that we constructed for the linear least squares problem is a special case of the Hessian matrix, which is defined as the second partial derivatives of the log-likelihood function with respect to the parameters of the model:

Hessianjk(a) =

In the case of a linear model, the second partial derivatives are just the product of the first partial derivatives. The Hessian gives information about the curvature of the likelihood surface.

The inverse of the Hessian is the covariance, or error, matrix, which gives the uncertainties on the parameters: the diagonal elements are σ2(aj), the the off-diagonal elements are the covariances.

The Fisher information matrix, I(a), is related to the Hessian: the observed information is the negative of the Hessian evaluated at the maximum likelihood estimate, and the expected information is the expectation value of the negative of the Hessian, i.e., the Hessian averaged over many data sets. The latter is important because it sets the minimum variance that one can expect for derived parameters. This is known as the Cramer-Rao bound: the minimum variance is the inverse of the expected information.

The basic idea is straighforward: if the curvature is large, then the peak of the likelihood can be determined to higher accuracy than if the curvature is smaller.

The Cramer-Rao bound seems to be coming up with increasing frequency in discussions and in the literature, often in the context of whether people are extracting maximal information from existing data, e.g., abundances and RV examples.

Scatter, parameter uncertainties and confidence limits

Fitting provides us a set of best-fit parameters, but, because of uncertainties and limited number of data points, these will not necessarily be the true parameters. One generally wants to understand how different the derived parameters might be from the true ones.

Scatter in the data may arise from measurement uncertainties, but also from intrinsic scatter in the relationship that is being determined. In the presence of (unknown) intrinsic scatter, the rms around the fit will always be larger than that expected from measurement uncertainty alone.

The covariance matrix gives information about the uncertainties on the derived parameters in the case of well-understood uncertainties that are strictly distributed according to a Gaussian distribution. These can be used to provide confidence levels on your parameters; see NR 15.6.5.

In some cases, however, you may not be in that situation. Given this, it is common to derive parameter uncertainties by numerical techniques.

A straightforward technique if you have a good understanding of your measurement uncertainties is to use Monte Carlo simulation. In this case, you simulate your data set multiple times, derive parameters for each simulated data set, and look at the range of fit parameters as compared with the input parameters. To be completely representative of your uncertainties, you would need to draw the simulated data set from the true distribution, but of course you don't know what that is (it's what you're trying to derive)! So we take the best-fit from the actual data as representative of the true distribution, and hope that the difference between the derived parameters from the simulated sets and the input parameters is representative of the difference between the actual data set and the true parameters. I believe this procedure is known by the terminology parametric bootstrap.

If you don't have a solid understanding of your data uncertainties, then Monte Carlo will not give an accurate representation of your parameter uncertainties. In this case, you can use multiple samples of your own data to get some estimate of the parameter uncertainties. A common technique is the non-parametric bootstrap technique, where, if you have N data points, you make multiple simulations using the same data, drawing N data points from the original data set – with replacement (i.e. same data point can be drawn more than once), derive parameters from multiple simulations, and look at the distribution of these parameters.

Another resampling technique is called jackknife resampling. Here, one removes a single data point from the data set each time, and recalculates the derived parameters, and looks at the distribution of these.

However, you may need to be careful about the interpretation of the confidence intervals determined by any of these techniques, which are based on a frequentist interpretation of the data. For these calculations, note that confidence intervals will change for different data sets. The frequentist interpretation is that the true value of the parameter will fall within the confidence levels at the frequency specified by your confidence interval. If you happen to have taken an unusual (infrequent) data set, the true parameters may not fall within the confidence levels derived from this data set!

There is some interesting discussion about the misapplication/misconception of the bootstrap method by astronomers in Laredo 2012 (which addresses several other misconceptions as well). The specific misconception here is variability = uncertainty"...

See Feigelson and Babu 3.6.2, 3.7.1, and 7.3.3 for additional discussion.

Minimization of other merit functions

Applicability of least-squares is limited to the case where the uncertainties are Gaussian. For other models, one wants to maximize the likelihood (or, usually, minimize -log(likelihood)). If the likelihood can be formulated analytically, this can be accomplished by numerical methods for minimizing a function, with respect to its parameters.

A minimizer without derivatives

For some problems, the model may be sufficiently complex that you cannot take the derivative of the model function with respect to the parameters. One example might be the case of fitting model spectra to an observed spectra. It is always possible to compute derivatives numerically and use these, however, this is not always the best solution, and one may want to consider methods of finding the minimum in parameter space that does not require derivatives.

A common minimization method that does not use derivatives is the downhill simplex" or Nelder-Mead algorithm. For this technique, the function is initially evaluated at M + 1 points (where M is the number of parameters). This makes an M-dimensional figure, called a simplex. The largest value is found, and a trial evaluation is made at a value reflected through the volume of the simplex. If this is smaller than the next-highest value of the original simplex, a value twice the distance is chosen and tested; if not, then a value half the distance is tested, until a better value is found. This point replaces the initial point, and the simplex has moved a step; if a better value can't be found, the entire simplex is contracted around the best point, and the process tries again. The process in then repeated until a convergence criterion is met. See NR Figure 10.5.1 and section 10.5. The simplex works its way down the function surface, expanding when it can take larger steps, and contracting when it needs to take smaller steps. Because of this behavior, it is also known as the amoeba" algorithm.

scipy routines for minimization

Note that, for methods that use derivatives, you can supply these analytically or you can have the routine calculate them numerically. The latter is generally computationally more expensive.

Note that I've just presented a few possibilities for minimizing non-linear functions. There are many more and a very large literature on the subject. If you find yourself having a non-trivial non-linear problem, you are highly advised to consider different possible approaches, as some may lead to significantly more efficient solutions than others!

See the minimization notebook

Genetic algorithms

Another set of methods that does not involve derivates are so-called genetic algorithms, see e.g. this article. The basic idea is to start with lots of different locations in parameter space, evaluate the likelihood, and then breed" the parameters of the best solutions with each other to find better solutions; see e.g, Charbonneau 1995, see also PIKAIA. The basic steps involve encoding the parameter information and then the operations of crossover and mutation; see here for some elaboration. These seem to be most useful for exploring parameter spaces that are very complex with multiple minima, and may provide an efficient way to find a global mininum. From what I read, it seems that they may not be particularly efficient in honing in on an accurate location for the global minimum once it is located.

A variant, differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html

Next: Bayesian inference Up: AY630 class notes Previous: Statistical inference