Let's start with an analysis of the likelihood, which is a common part of a frequentist analysis and also a component of the Bayesian analysis. Given a set of data and some model (with parameters), one considers the probability that the set of data will be observed if the model is correct, and we will prefer the set of parameters that maximizes this probability.
Consider the example of fitting a straight line to a set of data. How is the best fit line usually determined? Where does this come from? Consider that observed data points have some measurement uncertainty, σ. For now consider that all points have the same uncertainty, i.e. homoschedastic uncertainties, and that the uncertainties are distributed according to a Gaussian distribution.
What is the probability of observing a given data value from a known function y = ax + b, given an independent variable x_{i}, and a known Gaussian distribution of uncertainty in y? What's the probability of observing a series of two independently drawn data values?
The probability of observing a series of N data points is, given some model y(x_{i}| a_{j}), where the a_{j} represent a set of M parameters:
Maximizing the probability is the same thing as maximizing the logarithm of the probability, which is the same thing as minimizing the negative of the logarithm, i.e. minimize:
If σ is the same for all points, then this is equivalent to minimizing
Consider a simple application, multiple measurements of some quantity with the same uncertainty, σ, on all points (homoschedastic uncertainties), so the model is y(x_{i}) = μ where we want to determine the most probable value of the quantity, i.e. μ. What's the answer?
Minimizing gives:
What about the uncertainty on the mean? Just use error propagation, for x = f (u, v,...),
What about the case of unequal uncertainties on the data points, i.e. heteroschedastic uncertainties? Then
Often, people qualititatively judge the quality of a fit by the reduced χ^{2}, or χ^{2} per degree of freedom:
It is important to recognize that this analysis depends on the assumption that the uncertainties are distributed according to a normal (Gaussian) distribution!
Using χ^{2} as a method for checking uncertainties, or even determining them (at the expense of being able to say whether your model is a good representation of the data!). See ML 4.1
Back to the case of unequal uncertainties on the data points. For our simple model of measuring a single quantity, we have:
Let's move on to a slightly more complicated example: fitting a straight line to a set of data. Here, the model is
Again, we want to minimize χ^{2}, so we take the derivatives with respect to each of the parameters and set them to zero:
Solving algebraically, you get:
We also want the uncertainties in the parameters. Again, use propagation of errors to get (work not shown!):
Finally, we will want to calculate the probability of getting χ^{2} for our fit, in an effort to understand 1) if our uncertainties are not properly calculated and normally distributed, or 2) if our model is a poor model.
Note that there can be covariances between the parameters! Consider the case where the values are far from the origin. The covariance is given by :
We can generalize the least squares idea to any model that is some linear combination of terms that are a function of the independent variable, e.g.
such a model is called a linear model, because it is linear in the parameters (but not at all necessarily linear in the independent variable!). The model could be a polynomial of arbitrary order, but could also include trignometric functions, etc. We write the model in a simple form:
We can write the merit function as:
Separating the terms and interchanging the order of the sums, we get:
The form of these equations can be simplified even further by casting them in terms of the design matrix, A, which consists of N measurements of M terms:
In fact, this can be even further generalized to allow for covariances, σ_{ij} in the data:
For a given data set, we can calculate α and β either by doing the sums or by setting up the design matrix and using matrix arithmetic. We then need to solve the set of equations for a_{k}.
This is just a linear algebra problem, and there are well-developed techniques (see Numerical Recipes chapter 2). Most simply, we just invert the matrix α to get
A simple algorithm for inverting a matrix is called Gauss-Jordan elimination, but there are others. In particular, NR recommends the use of singular value decomposition for solving all but the simplest least-squares problems; this is especially important if your problem is nearly singular, i.e. where two or more of the equations may not be totally independent of each other: fully singular problems should be recognized and redefined, but it is possible to have non-singular problems encounter singularity under some conditions depending on how the data are sampled. See NR 15.4.2 and Chapter 2!
As an aside, note that it is possible to solve a linear set of equations significantly faster than inverting the matrix through various types of matrix decomposition, e.g., LU decomposition and Cholesky decomposition (again, see NR Chapter 2). There may be linear algebra problems that you encounter that only require the solution of the equations and not the inverse matrix. However, for the least squares problem, we often actually do want the inverse matrix C = α^{-1}, because its elements have meaning. Propagation of errors gives:
A Python implementation of matrix inversion and linear algebra in general can be found in scipy.linalg; linear least squares is implemented in scipy.linalg.leastsq(), but is really just an implementation of the matrix formulation.
For the specific linear case of a polynomial fit, you can use numpy.polyfit(). For the formulation of a design matrix for a polynomial fit, see numpy.vander().
Linear least squares is also implemented in astropy in the modeling module using the LinearLSQFitter class (which uses numpy.linalg); astropy has a number of standard models that you might use, or you can provide your own custom model.
from astropy.modeling import models, fitting fit_p = fitting.LinearLSQFitter() p_init = models.Polynomial1D(degree=degree) pfit = fit_p(p_init, x, y)Astropy has a number of common models, but you can also define your own model using models.custom_model, by supplying a function that returns values, and a function that returns derivatives with respect to the parameters.
Let's do it. Simulate a data set, fit a line, and output parameters, parameter uncertainties, chi2, and probability of chi2. See leastsquares.ipynb
Not all linear problems are just functional fits with small numbers of parameters A good example is spectral extraction for a multiobject fiber spectrograph. How is this problem formulated in the context of least squares? What does the α matrix look like? What needs to be solved? Note spectro-perfectionism example!
Note that for problems with large numbers of parameters, the linear algebra can become very computationally expensive. In many cases, however, the matrix that represents these problems may only be sparsely populated, i.e. so-called sparse matrices (see Numerical Recipes 2.7, Figure 2.7.1).
In such cases, there are methods that allow the equations to be solved significantly more efficiently. See LAPACK, etc.
However, not all fits are linear in the parameters .... which of the following are not?
In a linear fit, the χ^{2} surface is parabolic in parameter space, with a single minimum, and linear least squares can be used to determine the location of the minimum. In the non-linear case, however, the χ^{2} surface can be considerably more complex, and there is a (good) possibility that there are multiple minima, especially when there are multiple parameters, so one needs to be concerned about finding the global minimum and not just a local minimum; see here for some examples. Because of the complexity of the surface, the best fit can't be found with a simple calculation as we could for the linear fit.
A conceptually simple approach would be a grid search, where one simply tries all combinations of parameters and finds the one with the lowest χ^{2}. This can extremely computationally expensive, especially for problems with more than a few parameters! One is also forced to decide on a step size in the grid, although you might imagine a successively refined grid as you proceed.
Better approaches attempt to use the χ^{2} values to find the minimum more efficiently. They can generally be split into two classes: those that use the derivative of the function and those that don't. If you can calculate derivatives of χ^{2} with respect to your parameters, then this provides information about how far you can move in parameter space towards the minimum. If you can't calculate derivatives, you can evaluate χ^{2} at several different locations, and use these values to try to work your way towards the minimum.
Around the final minimum, the χ^{2} surface can be approximated as a parabola, and it is possible to correct the solution to the minimum solution if one can arrive at a set of parameters near to the final minimum, where the parabolic approximation is good. Instead of solving for the parameters, one solves for corrections to a set of initial parameters, where the data matrix is now the different between the observed data and the model with the set of initial parameters. For details, see NR 15.5 or Robinson Chapter 6.
One arrives at a set of equations closely parallel to the equations for the linear case:
similar to what we had for the linear problem, but note that the data array ( δβ) is now the difference between the data and the current guess model, and we are calculating corrections to the current guess parameters rather than being able to solve directly for the parameters.
The matrix α_{kl} is known as the curvature matrix. In most implementations, the second derivative term is dropped because it is negligible near the minimum because the sum over the y_{i} - y(x_{i}| a) term should cancel; see NR 15.5.1 for a partial explanation. The standard approach then uses:
To implement this, you choose a starting guess of parameters, solve for the corrections δa_{l} to be applied to a current set of parameters, and iterates until one arrives at a solution that does not change significantly.
Far from the solution, this method can be significantly off in providing a good correction, and, in fact, can even move parameters away from the correct solution. In this case, it may be necessary to simply head in the steepest downhill direction in χ^{2} space, which is known as the method of steepest descent. Note that while this sounds like a reasonable thing to do in all cases, it can be very inefficient in finding the final solution (see, e.g. NR Figure 10.8.1).
A common algorithm switches between the method of steepest descent and the parabolic approximation and is known as the Levenberg-Marquardt method. This is done by using a modified curvature matrix:
where
When λ is large, this gives a steepest descent method, with a larger λ corresponding to a smaller step; when small, a parabolic method. This leads to the following recipe:
This method, along with several others, is implemented in scipy.optimize.curve_fit(), where one provides input data and a model function.
The Levenberg-Marquardt method in also implemented in astropy in the modeling module using the LevMarLSQFitter which is used identically to the linear least squares fitter described above.
Two key issues in nonlinear least squares are finding a good starting guess and a good convergence criterion. You may want to consider multiple starting guesses to verify that you're not converging to a local minimum. For convergence, you can look at the amplitude of the change in parameters or the amplitude in the change of χ^{2}.
Example: fitting the width and/or center of a gaussian profile
Example: crowded field photometry: given an observation of a field with mulitple stars that are sufficiently close to one another that light from different stars overlaps, how do we infer the brightnesses of individual stars? The basic idea is that if we can measure the shapes (PSFs) of an isolated star and make the assumption that all stars have the same shift, and we will also assume that we can identify distinct stars, but not their exact positions. Given these, we can formulate this as a parameterized inference problem. The observed data is the distribution of intensities across all of the pixels. What is the model?
See leastsquares.ipynb.
For many problems, we may have data in more that one dimension, i.e. a multivariate problem, where we might want to look for relations between the different variables. In this case, rather than fitting a line to a paired set of points, we might want to fit a plane to a series of data points f (x, y), or even some higher number of dimensions.
Note that the general linear least squares formulation can be applied to problems with multiple independent variables, e.g., fitting a surface to a set of points; simply treat x as a vector of data, and the formulation is exactly the same.
See leastsquares.ipynb.
It is sometimes the case that one might want to look for a relationship between two variables, each of which has measurement uncertainty or intrinsic scatter. Ordinary least squares treats one variable as independent (and error-free) and one as dependent. Note Tully-Fisher example.
If you fit the model as x(y) rather than y(x) you do not in general get the same solution. There are several possible ways to deal with the situation. One is to do the forward and inverse fits and take the line that bisects them. Another is to fit for the minimum least-squares deviation perpendicular to the fitted relation (see Figure). See Feigelson and Babu 7.3.2. Note the discussion there about astronomers inventing a statistic ... which turns out to be OK in this case!
Alternatively, use a Bayesian approach ....
The matrix α that we constructed for the linear least squares problem is a special case of the Hessian matrix, which is defined as the second partial derivatives of the log-likelihood function with respect to the parameters of the model:
The inverse of the Hessian is the covariance, or error, matrix, which gives the uncertainties on the parameters: the diagonal elements are σ^{2}(a_{j}), the the off-diagonal elements are the covariances.
The Fisher information matrix, I(a), is related to the Hessian: the observed information is the negative of the Hessian evaluated at the maximum likelihood estimate, and the expected information is the expectation value of the negative of the Hessian, i.e., the Hessian averaged over many data sets. The latter is important because it sets the minimum variance that one can expect for derived parameters. This is known as the Cramer-Rao bound: the minimum variance is the inverse of the expected information.
The basic idea is straighforward: if the curvature is large, then the peak of the likelihood can be determined to higher accuracy than if the curvature is smaller.
The Cramer-Rao bound seems to be coming up with increasing frequency in discussions and in the literature, often in the context of whether people are extracting maximal information from existing data, e.g., abundances and RV examples.
Fitting provides us a set of best-fit parameters, but, because of uncertainties and limited number of data points, these will not necessarily be the true parameters. One generally wants to understand how different the derived parameters might be from the true ones.
Scatter in the data may arise from measurement uncertainties, but also from intrinsic scatter in the relationship that is being determined. In the presence of (unknown) intrinsic scatter, the rms around the fit will always be larger than that expected from measurement uncertainty alone.
The covariance matrix gives information about the uncertainties on the derived parameters in the case of well-understood uncertainties that are strictly distributed according to a Gaussian distribution. These can be used to provide confidence levels on your parameters; see NR 15.6.5.
In some cases, however, you may not be in that situation. Given this, it is common to derive parameter uncertainties by numerical techniques.
A straightforward technique if you have a good understanding of your measurement uncertainties is to use Monte Carlo simulation. In this case, you simulate your data set multiple times, derive parameters for each simulated data set, and look at the range of fit parameters as compared with the input parameters. To be completely representative of your uncertainties, you would need to draw the simulated data set from the true distribution, but of course you don't know what that is (it's what you're trying to derive)! So we take the best-fit from the actual data as representative of the true distribution, and hope that the difference between the derived parameters from the simulated sets and the input parameters is representative of the difference between the actual data set and the true parameters. I believe this procedure is known by the terminology parametric bootstrap.
If you don't have a solid understanding of your data uncertainties, then Monte Carlo will not give an accurate representation of your parameter uncertainties. In this case, you can use multiple samples of your own data to get some estimate of the parameter uncertainties. A common technique is the non-parametric bootstrap technique, where, if you have N data points, you make multiple simulations using the same data, drawing N data points from the original data set – with replacement (i.e. same data point can be drawn more than once), derive parameters from multiple simulations, and look at the distribution of these parameters.
Another resampling technique is called jackknife resampling. Here, one removes a single data point from the data set each time, and recalculates the derived parameters, and looks at the distribution of these.
However, you may need to be careful about the interpretation of the confidence intervals determined by any of these techniques, which are based on a frequentist interpretation of the data. For these calculations, note that confidence intervals will change for different data sets. The frequentist interpretation is that the true value of the parameter will fall within the confidence levels at the frequency specified by your confidence interval. If you happen to have taken an unusual (infrequent) data set, the true parameters may not fall within the confidence levels derived from this data set!
There is some interesting discussion about the misapplication/misconception of the bootstrap method by astronomers in Laredo 2012 (which addresses several other misconceptions as well). The specific misconception here is ``variability = uncertainty"...
See Feigelson and Babu 3.6.2, 3.7.1, and 7.3.3 for additional discussion.
See the fitting_uncertainties notebook
Applicability of least-squares is limited to the case where the uncertainties are Gaussian. For other models, one wants to maximize the likelihood (or, usually, minimize -log(likelihood)). If the likelihood can be formulated analytically, this can be accomplished by numerical methods for minimizing a function, with respect to its parameters.
For some problems, the model may be sufficiently complex that you cannot take the derivative of the model function with respect to the parameters. One example might be the case of fitting model spectra to an observed spectra. It is always possible to compute derivatives numerically and use these, however, this is not always the best solution, and one may want to consider methods of finding the minimum in parameter space that does not require derivatives.
A common minimization method that does not use derivatives is the ``downhill simplex" or Nelder-Mead algorithm. For this technique, the function is initially evaluated at M + 1 points (where M is the number of parameters). This makes an M-dimensional figure, called a simplex. The largest value is found, and a trial evaluation is made at a value reflected through the volume of the simplex. If this is smaller than the next-highest value of the original simplex, a value twice the distance is chosen and tested; if not, then a value half the distance is tested, until a better value is found. This point replaces the initial point, and the simplex has moved a step; if a better value can't be found, the entire simplex is contracted around the best point, and the process tries again. The process in then repeated until a convergence criterion is met. See NR Figure 10.5.1 and section 10.5. The simplex works its way down the function surface, expanding when it can take larger steps, and contracting when it needs to take smaller steps. Because of this behavior, it is also known as the ``amoeba" algorithm.
Python/scipy implementation of Nelder-Mead
https://scipy-lectures.org/advanced/mathematical_optimization/index.html
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
https://www.benfrederickson.com/numerical-optimization/
https://docs.mantidproject.org/v3.7.1/concepts/FittingMinimizers.html
Python/scipy summary of optimization routines in general
Note that, for methods that use derivatives, you can supply these analytically or you can have the routine calculate them numerically. The latter is generally computationally more expensive.
Note that I've just presented a few possibilities for minimizing non-linear functions. There are many more and a very large literature on the subject. If you find yourself having a non-trivial non-linear problem, you are highly advised to consider different possible approaches, as some may lead to significantly more efficient solutions than others!
See the minimization notebook
Another set of methods that does not involve derivates are so-called genetic algorithms, see e.g. this article. The basic idea is to start with lots of different locations in parameter space, evaluate the likelihood, and then ``breed" the parameters of the best solutions with each other to find better solutions; see e.g, Charbonneau 1995, see also PIKAIA. The basic steps involve encoding the parameter information and then the operations of crossover and mutation; see here for some elaboration. These seem to be most useful for exploring parameter spaces that are very complex with multiple minima, and may provide an efficient way to find a global mininum. From what I read, it seems that they may not be particularly efficient in honing in on an accurate location for the global minimum once it is located.
https://towardsdatascience.com/genetic-algorithm-implementation-in-python-5ab67bb124a6
A variant, differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html