{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import scipy.stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's write a routine to compute the covariance between two variables. Remember, this is caluclated using: $\\sigma_{xy} = {\\sum (x_i-\\bar x)(y_i- \\bar y) \\over n-1}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cov(x,y) :\n",
" \"\"\"Compute the covariance between two sets of values\"\"\"\n",
" n=len(x)\n",
" cov= # code here\n",
" return cov"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you were to make a 2D probability distribution function with independent variables, using a uniform distribution for each variable, what would you expect to get for the covariance? How about the full covariance matrix? What would you expect for a normal distribution?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ANSWER HERE:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's generate a 2D probabilty distribution function with uncorrelated variables, and use your routine to caculate the covariance. Let's compare it to the [numpy.cov()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) routine, which calculates the covariance matrix. How do the values compare with your expectation? How does the numpy routine calculate covariance by default (as compared to how it calculates variance)?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n=100\n",
"# Generate a set of n randomly distributed points (uniform or gaussian)\n",
"x= # code here\n",
"# generate an independent set of randomly distributed points\n",
"y= # code here\n",
"# plot them in a square plot\n",
"plt.figure()\n",
"axes=plt.gca()\n",
"axes.set_aspect('equal')\n",
"plt.plot(x,y,'ro')\n",
"# calculate and print the covariance. Compare with numpy.cov\n",
"c=cov(x,y)\n",
"print('covariance: ',c)\n",
"print('covariance matrix:')\n",
"c_matrix=np.cov(x,y)\n",
"print(c_matrix)\n",
"if not np.isclose(c,c_matrix[0,1]) : print('Problem with covariance calculation?')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's generate a perfectly correlated (or anticorrelated) data set. What do you predict for the covariance matrix in this case, again, both for a uniform distribution and a normal distribution in the variables?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ANSWER HERE: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n=10000\n",
"# generate a randomly distributed data set in one variable\n",
"x= # code here\n",
"# set the other variable to be THE SAME (i.e., not an independent random sample), or THE SAME with opposite sign\n",
"y= # code here\n",
"plt.figure()\n",
"axes=plt.gca()\n",
"axes.set_aspect('equal')\n",
"\n",
"plt.plot(x,y,'ro')\n",
"# calculate and print the covariance matrix\n",
"c=cov(x,y)\n",
"print('covariance: ',c)\n",
"c_matrix=np.cov(x,y)\n",
"print('covariance matrix:')\n",
"print(c_matrix)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's generate a correlated data set, but not with a perfect correlation. Do this by injecting some random noise on top of the underlying correlation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_correlated_sample(n,uniform=True,xmin=0,xmax=1,mean=0,sigma=1) :\n",
" \"\"\"Generated correlated sample data set\"\"\"\n",
" if uniform :\n",
" x=np.random.uniform(xmin,xmax,size=n)\n",
" else :\n",
" x=np.random.normal(mean,sigma,size=n)\n",
" # make the other variable correlated, but add some random noise to one of the variables\n",
" y= # code here\n",
" return x,y\n",
"\n",
"n=100\n",
"x,y=get_correlated_sample(n)\n",
"plt.figure()\n",
"axes=plt.gca()\n",
"axes.set_aspect('equal')\n",
"\n",
"plt.plot(x,y,'ro')\n",
"\n",
"# calculate and print the covariance matrix\n",
"c=cov(x,y)\n",
"print('covariance: ',c)\n",
"c_matrix=np.cov(x,y)\n",
"print('covariance matrix:')\n",
"print(c_matrix)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do you expect your results for the covariance matrices change if you changed the width of the underlying distributions, both for uniform and normally distributed variables?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ANSWER HERE:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Play around with your data sets, changing the widths and types (uniform and normal) of the underlying distributions to build up some intuition about the covariance matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, now let's work the other direction, i.e., given a covariance matrix, generate a multivariate distribution. For normal distributions, we can use the [numpy.random.multivariate_normal()](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.multivariate_normal.html) routine. Play around with this for different values of the input means and covariance matrix:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n= # set value\n",
"\n",
"# The desired mean values of the sample.\n",
"xmean= # set value\n",
"ymean= # set value\n",
"mu = np.array([xmean,ymean])\n",
"\n",
"# The desired covariance matrix.\n",
"c00=\n",
"c01=\n",
"c10=\n",
"c11=\n",
"r = np.array([\n",
" [ c00, c01 ],\n",
" [ c10, c11]\n",
" ])\n",
"\n",
"# Generate the random samples.\n",
"yvar = np.random.multivariate_normal(mu, r, size=n)\n",
"plt.figure()\n",
"axes=plt.gca()\n",
"axes.set_aspect('equal')\n",
"\n",
"plt.plot(yvar[:,0],yvar[:,1],'ro')\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"OK, now let's work with quantifying correlation for distributions with arbitrary width, using the correlation coefficient, which is essentially a normalized covariance.\n",
"\n",
"[scipy.stats.pearsonr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html) computes the (Pearson) correlation coeffient, and returns it plus the probability of getting the value given the number of points, given the null hypothesis. [scipy.stats.spearmanr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html) does the same thing for Spearman's correlation coefficient. Let's write a quick routine that calculates both for an input data set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def corrcoef(x,y) :\n",
" \"\"\" Compute Pearsons and Spearmans correlation coefficients and probabilities\"\"\"\n",
" pearson_r, pearson_p = # code here\n",
" spearman_r, spearman_p = # code here\n",
" print('Pearsons: ', pearson_r, pearson_p)\n",
" print('Spearmans: ',spearman_r, spearman_p)\n",
" \n",
"\n",
"corrcoef(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Experiment with generating sets of data with different covariance matrices and measuring the correlation coefficients"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we want to make a nonlinear relation? One way to accomplish this is to generate a random sample from a distribution of each x point, with some nonlinear relation. Play around with the relation and the scatter around it, and see the effect on the measured correlation coefficients."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}