{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Multivariate Bayes and marginalization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Fitting mean and standard deviation of a set of measurements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's generate a synthetic data set of measurements of some quantity that has some mean and variance, that we will want to try to determine from the data set\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a random distribution from input parameters m0, s0\n", "m0,s0= # your choice of input mean and standard deviation here!\n", "npts= # if we have a large number of points, we'll have to be careful about calculating\n", " # total probabilities, because they will be a very small number, so we will work with log(likelihood)\n", "y=np.random.normal(m0,s0,size=npts)\n", "plt.plot(y,'ro')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to determine the mean and the standard deviation of the sample.\n", "\n", "First, for a quick review, let's use the standard frequentist statistical approach. Given a set of N measurements, what is the most likely value of the mean and the standard deviation? These are just the sample mean and sample standard deviation! Compute the values for your simulated data set:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code to compute sample mean and standard deviation here\n", "sample_mean = # expression for mean\n", "sample_std = # expression for standard deviation\n", "print('sample mean: {:8.3f} sample standard deviation: {:8.3f}'.format(sample_mean,sample_std))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about the uncertainty on the sample mean? What is the standard result? \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sem= # expression for standard error of the mean\n", "print('uncertainty in mean: {:8.3f}'.format(sem))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's consider the Bayesian appproach to the problem. Here we want to determine the posterior probability distribution function for the mean and standard deviation of our data set.\n", "\n", "We will do so explicitly:\n", "$$P(\\mu,\\sigma | D) = {P(D|\\mu,\\sigma) P(\\mu,\\sigma) \\over P(D) }$$\n", "We are not able to calculate $P(D)$ so we will just do the numerator, but then require that it integrates to unity so that we have a proper posterior distribution function.\n", "\n", "First, write the routine for the likelihood of our data set, given arbitrary $\\mu$,$\\sigma$. The probability of a single data point is:\n", "\n", "$$P(x_i|\\mu,\\sigma) = {1\\over \\sigma\\sqrt{2\\pi}} \\exp {-0.5(x_i-\\mu)^2\\over \\sigma^2}$$\n", "\n", "The probability of the full data set is just the product of the individual probabilities.\n", "\n", "Write a routine to return the likelihood of the data set. You could use the [numpy.prod()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.prod.html) function to multiply the individual probabilities.\n", "\n", "However, note for large number of points (or even not so large!), the likelihood will get extremely small, and you may run into numerical underflow problems, so let's instead write a log(likelihood) routine. Here,\n", "$$log(L) = \\sum log(P(x_i|\\mu,\\sigma))$$\n", "One could just do the log of the Gaussians and sum them, but it's computationally inefficient, since you'll be doing exponetiation, then taking the log of it, so you can bypass that to get:\n", "$$log(L) = -N \\log \\sigma + \\sum{-0.5(x_i-\\mu)^2\\over\\sigma^2} $$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gauss(x,m,s) :\n", " \"\"\" Gaussian function\"\"\"\n", " return #expression for a Gaussian\n", " \n", "# log-likelihood function for given data, mean, and standard deviation\n", "def log_L(y,m,s) :\n", " \"\"\"return the loglikelihood of the data set for given mean and standard deviation\n", " \"\"\"\n", " # logL is a sum instead of a product\n", " return # expression for sum of log(gaussians)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the posterior PDF, we need to multiply the likelihood by the prior. What is an appropriate prior $P(\\mu,\\sigma)$?\n", "$\\mu$ is a location parameter, so a flat prior is appropriate. $\\sigma$ is a scaling parameter, so a Jeffreys prior $P(\\sigma) \\propto {1\\over \\sigma}$ is appropriate, so\n", "$$P(m,\\sigma) = {1\\over \\sigma}$$\n", "
\n", "Write a routine for the prior and for the log(prior):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#routine to return prior\n", "def prior(m,s) :\n", " \"\"\" return prior for input mean and standard deviation \"\"\"\n", " return # expression for prior\n", "\n", "def log_prior(m,s) :\n", " \"\"\" return log(prior)\"\"\"\n", " return # expression for log(prior)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now calculate the posterior probability for a grid of mean and standard deviation. Again, since the likelihood may be small and underflow, we'll work with the log of the posterior. If the posterior is the product of the likelihood and prior, then the log of the posterior is:\n", "$$log(P(M|D)) = log(L) + log(prior)$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def log_posterior(y,m,s) :\n", " \"\"\" Calculate log(posterior) for given set of data and trial mean and standard deviation\n", " \"\"\"\n", " return # return expression for log(posterior) using your routines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's calculate the log(posterior) over a range of trial means and standard deviations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make a grid of L for various mean and sigma\n", "n=100 # we'll use 100 points in trial means and standard deviations (so, 10000 calculations of the posterior!)\n", "\n", "# set appropriate ranges for your trial mean and standard deviation and create a grid over which\n", "# you will compute the posterior\n", "trial_means=np.linspace(,,n) # set range of means to try from inspection of data (add start and end)\n", "trial_stds=np.linspace(,,n) # set range of standard deviations to try from inspection of data\n", "\n", "# set up a nxn grid to store the calculations\n", "log_Ppdf=np.zeros([n,n])\n", "\n", "# fill the grid\n", "for ix,m in enumerate(trial_means) :\n", " for iy,s in enumerate(trial_stds) :\n", " log_Ppdf[iy,ix] = # use your function for log(posterior) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalize the posterior PDF so that it sums to unity, and display it. To do so, we need to go back to likelihood, not log(likelihood) but, to avoid tbhe very small number problem, we'll add a constant to log(L) so that this isn't an issue; this is not a problem because we're just going to normalize it anyway." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#if you are using a log(Ppdf), convert back to Ppdf, but after adding a constant to avoid underflow\n", "Ppdf = np.exp(log_Ppdf - log_Ppdf.max())\n", "\n", "# normalize the likelihood to have an integral of unity. We can just use a sum because we don't really care about\n", "# the bin size, which will just introduce a scaling constant, since we're going to normalize it\n", "Ppdf /= # normalize by dividing by sum\n", "\n", "# display the posterior\n", "plt.imshow(Ppdf)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now determine the marginal distributions for the mean, and determine the \"1-sigma\" credible region by looking at the cumulative distribution function of the marginal distribution. For the marginalization, note that numpy.sum() has an axis= keyword that allows you to sum over just that axis, i.e. turning a 2D array into a 1D array (as opposed to a single number without the axis= keyword).\n", "\n", "You may want to use the [numpy.cumsum()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cumsum.html) function for the credible region calculation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the marginal distribution function by marginalizing (summing) over the sigma axis\n", "mdf= # marginalize over the standard deviation\n", "\n", "# plot the marginal distribution\n", "plt.plot(mgrid,mdf)\n", "\n", "# get the cumulative distribution function of this to determined credible region\n", "cdf=mdf.cumsum()\n", "\n", "# determine the 1-sigma credible region, i.e. where cdf>0.16 and cdf<0.84\n", "m_16 = # mean value where cdf~0.16\n", "m_84 = # mean value where cdf~0.85\n", "\n", "#plot vertical lines for the credible region\n", "plt.plot([m_16,m_84],[0,0],color='r')\n", "\n", "# overplot the frequentist sample mean and its uncertainty for comparison\n", "plt.plot([sample_mean,sample_mean],plt.ylim(),color='g')\n", "plt.plot([sample_mean-sem,sample_mean-sem],plt.ylim(),color='g',ls=':')\n", "plt.plot([sample_mean+sem,sample_mean+sem],plt.ylim(),color='g',ls=':')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the frequentist perspective, the best estimate for the mean is the sample mean and the uncertainty of the mean is described by a Gaussian characterized by the standard error of the mean. How does the Bayesian posterior for the mean commpare with the frequentist perpsective?\n", "
\n", " ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now do the same marginal distribution for the standard deviation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the marginal distribution function by marginalizing (summing) over the neab axis\n", "mdf= # marginalize over mean axis\n", "plt.plot(sgrid,mdf)\n", "\n", "# get the cumulative distribution function of this to determined credible region\n", "cdf=mdf.cumsum()\n", "# determine the 1-sigma credible region, i.e. where cdf>0.16 and cdf<0.84\n", "s_16 = sgrid[np.argmin(np.abs(0.16-cdf))]\n", "s_84 = sgrid[np.argmin(np.abs(0.84-cdf))]\n", "#plot vertical lines for the credible region\n", "plt.plot([s_16,s_84],[0,0],color='r')\n", "\n", "# overplot the sample standard deviation\n", "plt.plot([sample_std,sample_std],plt.ylim(),color='g')\n", "\n", "# overplot the mean of the standard deviation posterior PDF\n", "mean=(sgrid*mdf).sum()\n", "plt.plot([mean,mean],plt.ylim(),color='r')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a simple routine for a 2D corner plot. You pass it the likelihood array and the vectors of trial values, along\n", "with axes labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# combine plots...\n", "def plot2(grid,xgrid,ygrid,vr=None,xt=None,yt=None) :\n", " \"\"\"function to plot likelihood and marginal distributions\n", " grid : input likelihood to display\n", " xgrid : range of x values\n", " ygrid : range of y values\n", " vr : range for scaling log(L) image\n", " xt, yt : titles for axes\n", " \"\"\"\n", " \n", " # make a 2x2 plot grid\n", " fig,ax=plt.subplots(2,2)\n", " if vr is None : \n", " # if no range is given, use min and max of data array\n", " vr = [grid.min(),grid.max()]\n", " # image of surface\n", " ax[0,0].imshow(grid,vmin=vr[0],vmax=vr[1],interpolation='nearest',origin='lower',\n", " aspect='auto',extent=(xgrid.min(),xgrid.max(),ygrid.min(),ygrid.max()))\n", " \n", " # marginal distribution summed over y axis\n", " ax[1,0].plot(xgrid,grid.sum(axis=0))\n", " ax[1,0].set_xlabel(xt)\n", " \n", " # marginal distribution summed over x axis, flipped\n", " ax[0,1].plot(grid.sum(axis=1),ygrid)\n", " ax[0,1].set_ylabel(yt)\n", " \n", " # get rid of unused plot\n", " ax[1,1].remove()\n", " fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the posterior PDF and the marginal distributions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot2(Ppdf,trial_means,trial_stds,xt='Mean',yt='$\\sigma$')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the frequentist perspective, the best estimate for the standard deviation is the sample standard deviation and the uncertainty of the standard is described by a Gaussian. How does the Bayesian posterior for the standard deviation commpare with the frequentist perpsective?\n", "\n", "
\n", " ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Fitting for parameters of a Gaussian emission line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, how about fitting a Gaussian line profile with Poisson noise? Start with some simulated data, allowing for 4 parameters: scale, location, width, and background. We'll assume that we know the location." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def model(x,amp,loc,wid,back) :\n", " # 4-parameter model with scale/amplitude, center, width, and background\n", " return amp * np.exp(-(x-loc)**2/2/wid**2) + back\n", "\n", "# simulated data\n", "# x values to calculate measurements at\n", "xdata=np.arange(100)\n", "cent=50\n", "amp= # amplitude of line\n", "wid= # width (Gaussian sigma) of line\n", "back= # background level\n", "\n", "# Use Poisson draws from the model\n", "ydata=np.random.poisson(model(xdata,amp,cent,wid,back))\n", "plt.plot(xdata,ydata,'ro')\n", "plt.plot(xdata,model(xdata,amp,cent,wid,back))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's define the log likelihood function and a prior, and use them to populate a posterior probability distribution function grid covering three parameters: scale, width, and background. To visualize, marginalize over background and use plot routine from above to look at likelihood in scale and width and the marginal distributions in those quantities.\n", "\n", "For the likelihood, as we've done before: for a Poisson distribution, the probability of getting an observed value, y, given an underlying model value, f, is: \n", "$$P(x|f) = {\\exp(-f) f^x \\over x!}$$\n", "Take the log to get:\n", "$$ ln(P(x)) = -f(x) + x ln(f) - ln(x!)$$\n", "Multiplying the probabilities of each individual point is taking the sum of the logs. Since the last term is independent of the model parameters, we can neglect it since we will be normalizing the PDF later.\n", "$$ln(L) = \\sum -f_i + y ln(f_i)$$\n", "where $f_i$ is the model value at each $x_i$\n", "\n", "For the priors, the amplitude and width are scale parameters, while the background is a location parameter, so uniform priors in the latter, and Jeffreys (1/x) priors for the former" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def log_L(xdata,ydata,amp,cent,wid,back) :\n", " \"\"\" calculate log(likelihood) for Poisson model given data and trial parameters\n", " \"\"\"\n", " return # expression for log(L)\n", "\n", "def log_prior(amp,cent,sig,back) :\n", " \"\"\" calculate log(prior): flat in mu, Jeffreys prior in amplitude and sigma\"\"\"\n", " return # expression for log(prior)\n", "\n", "def log_posterior(xdata,ydata,amp,cent,wid,back) :\n", " \"\"\" calculate log(posterior) using log(L) and log(prior)\n", " \"\"\"\n", " return # expresssion for log(posterior) using your functions\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now calculate a grid of log(posterior) at a range of trial amplitudes, widths, and backgrounds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate log(Ppdf) grid in amplitude (x), width(y), and background(z)\n", "n=50 # 50 points in each dimension --> 125000 calculations!\n", "\n", "log_Ppdf=np.zeros([n,n,n])\n", "trial_amps=np.linspace(,,n) # range of amplitudes to try (fill in start and end)\n", "trial_wids=np.linspace(,,n) # range of widths to try\n", "trial_backs=np.linspace(k,n) # range of backgrounds to try\n", "for ix,amp in enumerate(trial_amps) :\n", " for iy,wid in enumerate(trial_wids) :\n", " for iz,back in enumerate(trial_backs) :\n", " log_Ppdf[iz,iy,ix] = # use function for calculating posterior\n", " \n", "# convert to Ppdf from log(Ppdf), and normalize\n", "log_Ppdf-=log_Ppdf.max() # set peak to 0\n", "Ppdf=np.exp(log_Ppdf) # back to linear\n", "Ppdf/=np.sum(Ppdf) #normalize\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a routine for a 3D set of corner plots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " def plot3(grid,xgrid,ygrid,zgrid,xt=None,yt=None,zt=None) :\n", " \"\"\" Plot projections of 3D PDF\n", " \"\"\"\n", " fig,ax=plt.subplots(3,3,figsize=(12,8))\n", " # marginalize over z (background)\n", " tmp=np.sum(grid,axis=0)\n", " ax[1,0].imshow(tmp,interpolation='nearest',origin='lower',aspect='auto',\n", " extent=[xgrid[0],xgrid[-1],ygrid[0],ygrid[-1]])\n", " ax[1,0].set_ylabel(yt)\n", " ax[0,0].plot(xgrid,np.sum(tmp,axis=0))\n", " ax[1,1].plot(ygrid,np.sum(tmp,axis=1))\n", " \n", " # marginalize over y (sigma)\n", " tmp=np.sum(grid,axis=1)\n", " ax[2,0].imshow(tmp,interpolation='nearest',origin='lower',aspect='auto',\n", " extent=[xgrid[0],xgrid[-1],zgrid[0],zgrid[-1]])\n", " ax[2,0].set_xlabel('scale')\n", " ax[2,0].set_ylabel('background')\n", " ax[2,2].plot(zgrid,np.sum(tmp,axis=1))\n", " ax[2,2].set_xlabel('background')\n", " \n", " # marginalize over x(scale)\n", " tmp=np.sum(grid,axis=2)\n", " ax[2,1].imshow(tmp,interpolation='nearest',origin='lower',aspect='auto',\n", " extent=[ygrid[0],ygrid[-1],zgrid[0],zgrid[-1]])\n", " ax[2,1].set_xlabel('sigma')\n", " ax[1,2].remove()\n", " ax[0,2].remove()\n", " ax[0,1].remove()\n", " fig.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# do the plot\n", "plot3(Ppdf,trial_amps,trial_wids,trial_backs,xt='scale',yt='wid',zt='back')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discuss the results.\n", "
\n", " ANSWER HERE: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }