{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to be able to experiment with expectation values for various estimators for different underlying populations.\n", "First, let's write a routine that returns the sample mean and variance for an input sample. Let's do this using numpy\n", "array arithmetic, but only using the sum() method. These should be calculated using:$\\bar x = {1\\over N} \\Sigma x_i$ and $\\sigma^2 = {1\\over N-1}\\Sigma (x_i-\\bar x)^2$. You have to fill in the missing code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def stats(sample) :\n", " \"\"\" Return the sample mean and standard deviation of an input sample\"\"\"\n", " # calculation of mean and variance here\n", " n=len(sample) \n", " mean= #insert code here\n", " variance= #insert code here\n", " \n", " return mean, variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now write a routine that will generate a sample of input size, with an option for using either a uniform or a Gaussian (normal) distribution. Here we will consider a uniform distribution with values distributed between 0 and 1, and a normal distribution with zero mean and unit standard deviation. Samples for these distributions can be generated using the [numpy.random.uniform()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.uniform.html) function or the [numpy.random.normal()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.normal.html) function; you have to fill in the missing function calls:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getsample(size,uniform=True) :\n", " \"\"\"Generate a sample of input size from a uniform distribution (0-1) \n", " if uniform is True, else a normal distribution with zero mean\n", " and unit standard deviation\n", " \"\"\"\n", " if uniform :\n", " # np.random.uniform call here\n", " \n", " else :\n", " # np.random.normal call here\n", " \n", " \n", " return sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a quick test of your routines. Generate a sample using the getsample() routine (test both uniform and normal), use stats() to get the mean and variance. Check the results using the [numpy.mean()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and [numpy.var()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html) methods on the samples. Check the samples by looking at a histogram. Do this for both a uniform and a normal distribution. You have to choose the sample size and the distribution type, and set bins for the histogram accordingly:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n= # choose a sample size here\n", "uniform= # boolean to set type of distribution here\n", "sample=getsample(n,uniform=uniform)\n", "xmin=\n", "xmax=\n", "delta=\n", "plt.hist(sample,bins=np.arange(xmin,xmax,delta)) # set appropriate bins here using xmin,xmax,delta\n", "mean,variance=stats(sample)\n", "print('calculated mean: {:7.2f} variance: {:7.4f}'.format(mean,variance))\n", "print('numpy mean: {:7.2f} variance {:7.4f}'.format(sample.mean(),sample.var()))\n", "if not np.isclose(sample.mean(),mean) : print('PROBLEM WITH MEAN!')\n", "if not np.isclose(sample.var(),variance) : print('PROBLEM WITH VARIANCE? How does numpy compute variance by default?')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment with running this multiple times, with different sample sizes, say from 10 points per sample to 1000 points per sample. \n", "\n", "What did you learn about the numpy variance method (which is also true for the standard deviation method)?What do you think about your distributions and your estimators? How do they change with size of the sample?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the true values of the mean and variance for a uniform distribution? Calculate them analytically (show your work). Do you get these values exactly with your samples? Why or why not?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's generate some expectation values. To do this, let's write a routine to generate a large number of samples (nsamp), each of size n, calculate the sample mean and variance for each, then average these together to get the expectation value. So, the expectation value of the mean is the \"mean mean\", and the expectation value of the variance is the \"mean variance\". Let's also calculate the \"mean standard deviation\". You hsve to fill in the expressions to calculate the expectation values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def expectations(nsamp,n,uniform=True) :\n", " \"\"\" Calculate expectation values by drawing nsamp samples, each one with n members\n", " and calculcating expectation values by averaging the statistics from each sample\n", " \"\"\"\n", " all_means=[]\n", " all_variances=[]\n", " all_std=[]\n", " for i in range(nsamp) :\n", " sample=getsample(n,uniform=uniform)\n", " mean,variance=stats(sample)\n", " all_means.append(mean)\n", " all_variances.append(variance)\n", " all_std.append(np.sqrt(variance))\n", " all_means=np.array(all_means)\n", " all_variances=np.array(all_variances)\n", " all_std=np.array(all_std)\n", " \n", " expectation_mean= #add expressions\n", " expectation_variance=\n", " expectation_std=\n", " return expectation_mean,expectation_variance,expectation_std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do some tests of your routine:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nsamp=\n", "n=\n", "m,v,s=expectations(nsamp,n)\n", "print(m,v,s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try running this for a large number of samples, for a range of different sample sizes. Make plots of these expectation values as a function of sample size. Make sure you understand the code here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nsamp=10000\n", "sizes=[2,4,6,8,10,100,1000]\n", "all_expectation_mean=[]\n", "all_expectation_variance=[]\n", "all_expectation_std=[]\n", "for n in sizes :\n", " m,v,s=expectations(nsamp,n)\n", " all_expectation_mean.append(m)\n", " all_expectation_variance.append(v)\n", " all_expectation_std.append(s)\n", " \n", "plt.plot(sizes,all_expectation_mean,'ro')\n", "plt.xscale('log')\n", "plt.figure()\n", "plt.plot(sizes,all_expectation_variance,'ro')\n", "plt.xscale('log')\n", "plt.figure()\n", "plt.plot(sizes,all_expectation_std,'ro')\n", "plt.xscale('log')\n", "print(m,s)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are the expectation values of the mean, variance, and standard deviation biased or unbiased? Are they consistent (converge to correct value as n increases)?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ANSWER HERE: :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now lets consider the variance of the estimators. What do you expect (quantitiatively!) for the standard deviation of the mean (standard error of the mean) given the analytic result discussed in class and in readings?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate a bunch of means, and look at their spread and variance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getmeans(nsamp=1000,n=100,uniform=True) :\n", " \"\"\"return an array of sample means for input number of samples and sample sizes\n", " \"\"\"\n", " all_means=[]\n", " for i in range(nsamp) :\n", " sample=getsample(n,uniform=uniform)\n", " mean,variance=stats(sample)\n", " all_means.append(mean)\n", " return np.array(all_means)\n", "\n", "# experiment with different values of nsamp, n, uniform\n", "all_means= getmeans(nsamp=1000,n=100,uniform=True)\n", "plt.hist(all_means)\n", "print(np.array(all_means).std())\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the variance in the mean depend on the sample size? Calculate the standard deviation for a number of sample sizes, \n", "and plot as a function of sample size to check the behavior:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the behavior of the variance in the mean agree with your expectation?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you calculated the median instead of the mean? What would you expect (quantitatively!) for the standard deviation of the median as compared with that of the mean? As a function of sample size?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now do the same experiment for the median of each sample. Write a getmedians() routine like getmeans() above. You can use np.median(sample) instead of your stats routine to calculate the median of an individual sample. Make the plot of variances as a function of sample size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did it come out as expected?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consider other, more robust, estimators for the variance or standard deviation. We talked about two such estimators, the interquartile range (IQR), and the mean absolute deviation (MAD). For a normal distribution, $\\sigma$ = 0.7413 * IQR, and $\\sigma$ = 1.253 * MAD. Write routines to calculate these (note the scipy.stats.iqr() routine; for the mean absolute deviation, you should be able to calculate it in a single line with numpy routines). Determine if these estimators are unbiased and consistent. Consider the variance of the estimators; which do you prefer?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import scipy\n", "def getiqr(nsamp=1000,n=100,uniform=True) :\n", " \"\"\"return an array of sample IQR for input number of samples and sample sizes\n", " \"\"\"\n", " # add code here\n", " \n", " \n", "def getmad(nsamp=1000,n=100,uniform=True) :\n", " \"\"\"return an array of sample MAD for input number of samples and sample sizes\n", " \"\"\"\n", " # add code here\n", " \n", "# plot expectation value of the estimator as a function of sample size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot variance of the estimator as a function of sample size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discuss your findings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ANSWER HERE :" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }