{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import scipy.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a routine to compute the covariance between two variables. Remember, this is caluclated using: $\\sigma_{xy} = {\\sum (x_i-\\bar x)(y_i- \\bar y) \\over n-1}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cov(x,y) :\n", " \"\"\"Compute the covariance between two sets of values\"\"\"\n", " n=len(x)\n", " cov= # code here\n", " return cov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you were to make a 2D probability distribution function with independent variables, using a uniform distribution for each variable, what would you expect to get for the covariance? How about the full covariance matrix? What would you expect for a normal distribution?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's generate a 2D probabilty distribution function with uncorrelated variables, and use your routine to caculate the covariance. Let's compare it to the [numpy.cov()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) routine, which calculates the covariance matrix. How do the values compare with your expectation? How does the numpy routine calculate covariance by default (as compared to how it calculates variance)?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n=100\n", "# Generate a set of n randomly distributed points (uniform or gaussian)\n", "x= # code here\n", "# generate an independent set of randomly distributed points\n", "y= # code here\n", "# plot them in a square plot\n", "plt.figure()\n", "axes=plt.gca()\n", "axes.set_aspect('equal')\n", "plt.plot(x,y,'ro')\n", "# calculate and print the covariance. Compare with numpy.cov\n", "c=cov(x,y)\n", "print('covariance: ',c)\n", "print('covariance matrix:')\n", "c_matrix=np.cov(x,y)\n", "print(c_matrix)\n", "if not np.isclose(c,c_matrix[0,1]) : print('Problem with covariance calculation?')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's generate a perfectly correlated (or anticorrelated) data set. What do you predict for the covariance matrix in this case, again, both for a uniform distribution and a normal distribution in the variables?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ANSWER HERE: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n=10000\n", "# generate a randomly distributed data set in one variable\n", "x= # code here\n", "# set the other variable to be THE SAME (i.e., not an independent random sample), or THE SAME with opposite sign\n", "y= # code here\n", "plt.figure()\n", "axes=plt.gca()\n", "axes.set_aspect('equal')\n", "\n", "plt.plot(x,y,'ro')\n", "# calculate and print the covariance matrix\n", "c=cov(x,y)\n", "print('covariance: ',c)\n", "c_matrix=np.cov(x,y)\n", "print('covariance matrix:')\n", "print(c_matrix)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's generate a correlated data set, but not with a perfect correlation. Do this by injecting some random noise on top of the underlying correlation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_correlated_sample(n,uniform=True,xmin=0,xmax=1,mean=0,sigma=1) :\n", " \"\"\"Generated correlated sample data set\"\"\"\n", " if uniform :\n", " x=np.random.uniform(xmin,xmax,size=n)\n", " else :\n", " x=np.random.normal(mean,sigma,size=n)\n", " # make the other variable correlated, but add some random noise to one of the variables\n", " y= # code here\n", " return x,y\n", "\n", "n=100\n", "x,y=get_correlated_sample(n)\n", "plt.figure()\n", "axes=plt.gca()\n", "axes.set_aspect('equal')\n", "\n", "plt.plot(x,y,'ro')\n", "\n", "# calculate and print the covariance matrix\n", "c=cov(x,y)\n", "print('covariance: ',c)\n", "c_matrix=np.cov(x,y)\n", "print('covariance matrix:')\n", "print(c_matrix)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do you expect your results for the covariance matrices change if you changed the width of the underlying distributions, both for uniform and normally distributed variables?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Play around with your data sets, changing the widths and types (uniform and normal) of the underlying distributions to build up some intuition about the covariance matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's work the other direction, i.e., given a covariance matrix, generate a multivariate distribution. For normal distributions, we can use the [numpy.random.multivariate_normal()](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.multivariate_normal.html) routine. Play around with this for different values of the input means and covariance matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n= # set value\n", "\n", "# The desired mean values of the sample.\n", "xmean= # set value\n", "ymean= # set value\n", "mu = np.array([xmean,ymean])\n", "\n", "# The desired covariance matrix.\n", "c00=\n", "c01=\n", "c10=\n", "c11=\n", "r = np.array([\n", " [ c00, c01 ],\n", " [ c10, c11]\n", " ])\n", "\n", "# Generate the random samples.\n", "yvar = np.random.multivariate_normal(mu, r, size=n)\n", "plt.figure()\n", "axes=plt.gca()\n", "axes.set_aspect('equal')\n", "\n", "plt.plot(yvar[:,0],yvar[:,1],'ro')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "OK, now let's work with quantifying correlation for distributions with arbitrary width, using the correlation coefficient, which is essentially a normalized covariance.\n", "\n", "[scipy.stats.pearsonr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html) computes the (Pearson) correlation coeffient, and returns it plus the probability of getting the value given the number of points, given the null hypothesis. [scipy.stats.spearmanr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html) does the same thing for Spearman's correlation coefficient. Let's write a quick routine that calculates both for an input data set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def corrcoef(x,y) :\n", " \"\"\" Compute Pearsons and Spearmans correlation coefficients and probabilities\"\"\"\n", " pearson_r, pearson_p = # code here\n", " spearman_r, spearman_p = # code here\n", " print('Pearsons: ', pearson_r, pearson_p)\n", " print('Spearmans: ',spearman_r, spearman_p)\n", " \n", "\n", "corrcoef(x,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment with generating sets of data with different covariance matrices and measuring the correlation coefficients" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we want to make a nonlinear relation? One way to accomplish this is to generate a random sample from a distribution of each x point, with some nonlinear relation. Play around with the relation and the scatter around it, and see the effect on the measured correlation coefficients." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }