{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model comparison

\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"from scipy.stats import chi2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by simulating a data set with some polynomial plus noise. We will use numpy.polyval() to generate the polynomial: you get to choose what order you want by setting the number of input parameters (choose something between 3 and 6)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# simulate a data set witha polynomial and Gaussian uncertainties\n",
"npts=20\n",
"x=np.linspace(0,1,npts)\n",
"\n",
"# choose parameters for a polynomial here. Number of parameters you set will determine order of polynomial\n",
"truepars= # your choice of parameters and polynomial order there\n",
"sig= # set uncertaintiesm, make sure they are noticeably large\n",
"\n",
"y=np.polyval(truepars,x)+np.random.normal(0,sig)\n",
"\n",
"plt.errorbar(x,y,sig,fmt='o') # plot it\n",
"plt.plot(x,np.polyval(truepars,x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### X$^2$ as a function of number of parameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's imagine that we have this data set, but don't know a priori what order fit to use. We want to answer the question: how do we know the most justified order of polynomial to use to do the fit? \n",
"\n",
"The problem is that it is hard to tell whether additional parameters are warranted: are they fitting real structure in the data, or are they fitting noise?\n",
"\n",
"First, let's demonstrate how X$^2$ varies with number of parameters. Remember,\n",
"$$X^2 = \\sum {(y_i - f(x_i|p))^2\\over \\sigma_i^2}$$\n",
"$f$ is the model function, in this case, numpy.polyval().\n",
"\n",
"We'll do the fits using our standard linear algebra:\n",
"$$par = (A^T\\Sigma^{-1} A)^{-1} A^T\\Sigma^{-1} y$$\n",
"where $A$ is the design matrix (the derivatives of the data with respect each parameter, that depends on the independent variable), $\\Sigma$ is the covariance matrix, and $y$ is the dependent variable array."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# set up an empty list to accumulate our X2 values\n",
"X2_vals=[]\n",
"\n",
"# loop over different number of parameters, e.g., fit from first to eighth order\n",
"npars=range(1,8)\n",
"for npar in npars : \n",
" # do the least squares fit\n",
" # design matrix for a polynomial is a Vandermonde matrix (np.vander)\n",
" design=\n",
" # covariance matrix (without covariances in the data, just diagonal, note np.diag\n",
" C=\n",
" # inverse of covariance matrix\n",
" Cinv=\n",
" # get left hand size AT Sigma-1 A\n",
" ATA=np.dot(np.dot(design.T,Cinv),design)\n",
" # solve for the parameters with rhs = AT Sigma-1 y\n",
" par=np.linalg.solve(ATA,np.dot(np.dot(design.T,Cinv),y))\n",
" \n",
" #plot the fit\n",
" plt.errorbar(x,y,sig,fmt='ro')\n",
" xmod=np.arange(0,1,0.01)\n",
" plt.plot(xmod,np.polyval(par,xmod))\n",
" \n",
" # calculate X2\n",
" X2= # fill in the X**2 expression\n",
" X2_vals.append(X2)\n",
"\n",
"#plot the X**2 values as a function of number of paramters\n",
"plt.figure()\n",
"plt.plot(npars,X2_vals,'ro')\n",
"plt.xlabel('npar')\n",
"plt.ylabel(r'$\\chi^2$')\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does $X^2$ change with the number of parameters? Can you use this to determine the best order of the fit? Why or why not?\n",
"

\n",
"** ANSWER HERE: **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"

### AIC and BIC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to try to determine the most appropriate model is to use the Akaike Information Criterion:\n",
"$$AIC = -2 ln(L_{max}) + 2 J$$\n",
"where $L_{max}$ is the maximum likelihoold value, and $J$ is the number of parameters.\n",
"\n",
"Another is the Bayesian Information Criterion:\n",
"$$BIC = -2 ln(L_{max}) + J ln(N)$$\n",
"where $N$ is the number of data points.\n",
"\n",
"Supply routines to calculate ln(L), AIC, and BIC, for polynomial fits. Remember, for normally distributed uncertainties:\n",
"$$L = \\prod {1\\over \\sqrt{2\\pi\\sigma_i}}exp{-0.5 (y_i - f(x_i|p)^2\\over \\sigma_i^2}$$\n",
"$$ln(L) = \\sum {-0.5 (y_i - f(x_i|p)^2\\over \\sigma_i^2} + const$$\n",
"Since we are looking to determine the minimum of AIC/BIC, we can ignore the constant.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def logL(par,x,y,sig) :\n",
" \"\"\" log(likelihood function)\n",
" par are model parameters\n",
" x,y,sig are observed data points and (Gaussian) uncertainties\n",
" \"\"\"\n",
" return # fill in exporession for log(likelihood)\n",
"\n",
"def aic(par,x,y,sig) :\n",
" \"\"\" Compute Akaike Information Criterion (AIC)\n",
" \"\"\"\n",
" J= # number of parameters\n",
" return # fill in expression for AIC\n",
"\n",
"def bic(par,x,y,sig) :\n",
" \"\"\" Compute Bayesian Information Criterion (BIC)\n",
" \"\"\"\n",
" J= # number of paramters\n",
" N= # number of data points\n",
" return # fill in expression for BIC\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will do fits of a range of polynomial orders, get the best fits, and compute AIC and BIC, then plot these against polynomial fit order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# initialize accumulators\n",
"aic_vals=[]\n",
"bic_vals=[]\n",
"chi2_vals=[]\n",
"chi2prob_vals=[]\n",
"\n",
"# loop over polynomial fit orders\n",
"npars=range(1,8)\n",
"for npar in npars :\n",
" \n",
" # do the least squares fit exactly as above, just having you retype it so it perhaps sinks in better?\n",
" design=\n",
" C=\n",
" Cinv=\n",
" # solve\n",
" ATA=\n",
" par=\n",
" \n",
" #plot the fit\n",
" plt.errorbar(x,y,sig,fmt='ro')\n",
" xmod=np.arange(0,1,0.01)\n",
" plt.plot(xmod,np.polyval(par,xmod))\n",
" \n",
" #calculate AIC and BIC\n",
" aic_vals.append(aic(par,x,y,sig))\n",
" bic_vals.append(bic(par,x,y,sig))\n",
" #plt.ylim(0,20)\n",
"\n",
"\n",
"print(aic_vals)\n",
"plt.figure()\n",
"plt.plot(npars,aic_vals,'ro',label='AIC')\n",
"plt.plot(npars,bic_vals,'bo',label='BIC')\n",
"plt.legend()\n",
"plt.ylim(0,50)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on AIC and BIC, which number of parameters would you choose? Does this match the number of parameters in the model that you used to generate the data set?\n",
"

** ANSWER HERE: **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"

### Cross validation\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Another way to approach the problem is to consider whether a different data set, drawn from the same distribution, gives the same qualitative fit: if so, then perhaps the model is warranted, if not, then noise may be having a strong effect (overfitting). This is the basic idea behind cross-validation: develop a model from a subset of the data and judge it by how it performs against a different subset.\n",
"\n",
"Let's generate a different data set, and then split it into a training set and a cross-validation set. To generate the random sample for the training set, we can use numpy.random.shuffle(). Since we want to keep the (x,y) pairs together,we'll shuffle an array with the indices.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"npts=30\n",
"\n",
"# simulate another data set\n",
"xall=np.linspace(0,1,npts)\n",
"truepars= # set list/array with model parameters\n",
"yall=np.polyval(truepars,xall)+np.random.normal(0,sig)\n",
"sigall= # set uncertainties so they are noticeable\n",
"\n",
"# to choose a random sample, shuffle the indices with numpy.random.shuffle()\n",
"indices=np.arange(npts)\n",
"np.random.shuffle(indices)\n",
"\n",
"# create training set. Here's we'll use 70% of the data for the training set, and the remaining 30% for the cross validation\n",
"ntrain=int(0.7*npts)\n",
"# use indices[0:ntrain] of xall and yall\n",
"xtrain=\n",
"ytrain=\n",
"sigtrain=\n",
"\n",
"# create cross velidation set\n",
"# use indices[ntrain:] (the remainder) for cross validation set\n",
"xcv=\n",
"ycv=\n",
"sigcv=\n",
"\n",
"# plot training and cross-validation sets\n",
"plt.plot(xtrain,ytrain,'ro')\n",
"xmod=np.arange(0,1,0.01)\n",
"plt.plot(xmod,np.polyval(truepars,xmod))\n",
"plt.plot(xcv,ycv,'go')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will again fit the initial data set, but we will evaluate the likelihood of the cross-validation set as well as the training set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#initialize lists to save the X2 values\n",
"X2_vals=[]\n",
"X2_cv_vals=[]\n",
"\n",
"# loop over different number of parameters in the fit, using the training set only to do the fit\n",
"for npar in npars :\n",
" design=np.vander(xtrain,npar,increasing=False)\n",
" C=np.diag(sigtrain**2)\n",
" Cinv=np.linalg.inv(C)\n",
"\n",
" # solve\n",
" ATA=np.dot(np.dot(design.T,Cinv),design)\n",
" par=np.linalg.solve(ATA,np.dot(np.dot(design.T,Cinv),ytrain))\n",
"\n",
" # calculate X2 for the training set\n",
" X2=\n",
" X2_vals.append(X2)\n",
" \n",
" # calculate X2 for the cross validation set\n",
" X2_cv=\n",
" X2_cv_vals.append(X2_cv)\n",
"\n",
"# plot the X2 values for both sets as a function of hte number of paramters\n",
"plt.plot(npars,X2_vals,label='training')\n",
"plt.plot(npars,X2_cv_vals,label='cross-validation')\n",
"# may need to adjust limits to see the minima well\n",
"plt.ylim(0,50)\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do the training and cross-validation values of $X^2$ change with the number of paramters? How does this indicate what an appropriate number of parameters to use for the fit is?\n",
"

\n",
"** ANSWER HERE: **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"

### Regularization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yet another approach is to use regularization in a fit with more parameters/flexibility.\n",
"\n",
"For ridge regression, it turns out that the regularized solution just requires a small modification to the least squares solution:\n",
"\n",
"$$ pars = (A^T C^{-1} A + λI)^{-1} A^T C^{-1} y$$\n",
"where $I$ is the identity matrix.\n",
"\n",
"But, first we need to standardize the independent variable so that it is centered on zero, and has quantities near unity, i.e. transform to:\n",
"$$x' = {(x-)\\over\\sigma_x}$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"npts=30\n",
"xx=(xall-xall.mean())/xall.std() # fill in expression for normalized independent variable\n",
"\n",
"# plot data vs normalized independent variable\n",
"plt.plot(xx,yall,'ro')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will try a range of regularized fits with different regularization parameters, using a high order fit. We will print out the values of the parameters. Here, we will use our full data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# order of model to fit\n",
"npar=8\n",
"\n",
"# loop over regularization parameter values\n",
"for lam in np.arange(0.,1,0.1):\n",
" \n",
" design=np.vander(xx,npar,increasing=False)\n",
" C=np.diag(sigall**2)\n",
" Cinv=np.linalg.inv(C)\n",
" # solve\n",
" ATA=np.dot(np.dot(design.T,Cinv),design)+np.diag(lam*np.ones(npar)) # this is where the regularization comes in\n",
" pars=np.linalg.solve(ATA,np.dot(np.dot(design.T,Cinv),yall))\n",
"\n",
" # set number of digits to print\n",
" np.set_printoptions(precision=4,suppress=True)\n",
" print(pars)\n",
" \n",
" # plot the fit\n",
" plt.plot(xx,yall,'ro')\n",
" xmod=np.arange(0,1,0.01)\n",
" xxmod=(xmod-x.mean())/x.std()\n",
" plt.plot(xxmod,np.polyval(pars,xxmod))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does increasing the value of the regularization parameter $\\lambda$ affect the values of the parameters? How does it affect the shape of the fit?\n",
"

\n",
"** ANSWER HERE: **"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}