{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Dimensionality reduction : Principal component analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by loading a multi-dimensional data set, from the file ndim.txt using numpy.loadtxt() " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X=np.loadtxt('ndim.txt')\n", "print(X.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset has 100 data points, which is the first index. The second index gives the number of characteristics per \n", "data point, i.e. the dimensionality of the data. How many dimensions does this data have?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Investigate this data set by making some plots. Do you think that it is a good candidate for dimensionality reduction? Why or why not? \n", "
ANSWER HERE: \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, let's see what principal component analysis tells us. We will use the sklearn.decomposition.PCA() routine. Start by instantiating a PCA object. If you don't specify a number of components, it will do a full decomposition into the same number of projected components as input components, and we can then investigate how many of these might be needed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "pca= #instatiate a PCA object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do the PCA decomposition, we can invoke the fit() method, which will just load the decomposition into the instantiated object. To transform the original data into the new coordinate system, you can use the transform() method, which will return a new data array with coordinates in the transfomed system. You can do both of these steps together using the fit_transform() method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_projected= # do the PCA decomposition and get the transformed coordinates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inverse_transform() method will take the transformed coordinates and transform them back to the original coordinate system. If you use all of the transformed dimensions, you should get back exactly what you started with. However, if you use fewer dimensions, you will not necessarily get the data back, depending on the degree of correlations in the system. Investigate how changing the number of components affects your ability to reproduce the original data. You can do this either by introducing the n_components keyword in the PCA() instantiation, or by setting the higher dimensions of the projected coordinates to zero, because these are sorted by their contribution to the variance (higher eigenvalues first). Then plot the difference between the original data and the inverse transformed data, paying attention to the values on the y-axis.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import copy\n", "# if you want, you can set some of the components to zero before the inverse transform, or you can rerun\n", "# above by setting n_commponents in the PCA() instatiation\n", "tmp=copy.copy(X_projected)\n", "tmp[:, ...] = 0.\n", "X_inv = # get the inverse transformed coordinates\n", "\n", "for indata,outdata in zip(X,X_inv) :\n", " plt.plot(indata-outdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many dimensions do you need to be able to reproduce the input data>\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get at the same thing using the variances computed by the PCA routine. The attribute explained_variance_ contains the variances for each component. The attribute explained_variance_ratio_ expresses these as a fraction of the total variance. Plotting either of these will give you a \"scree plot.\"\n", "
\n", "Doing a cumulative sum on explained_variance_ratio_ will show the cumulative fraction of how much of the total variance is explained as each component is added.\n", "
\n", "Create both plots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(pca.components_.shape)\n", "plt.figure()\n", "plt.plot(...) # plot the explained_variance_\n", "plt.figure()\n", "plt.plot(...) # plot the explained_variance_ratio\n", "plt.figure()\n", "plt.plot(...) # plot the cumulative sum of explained_variance_ratio_\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on these plots, how many dimensions are needed to characterize the data?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can look at the eigenvectors themselves, which are stored in the components_ attribute. Plot these for the components that explain the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(pca.components_.shape)\n", "for i in range(3) :\n", " plt.plot(...,label='component: {:d}'.format(i)) # plot the ith component\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you describe what these eigenvectors represent?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can also turn things around and ask the PCA routine to tell you how many components are needed to account for a desired fraction of the total variance, by specifying this fraction in the PCA instatiation with the desired_fraction= keyword (note in later version of scikit-learn, this might be specified using n_components if value is between 0 and 1). The number of required components will be saved in the n_components_ attribute after the fit() is done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# turn things around to have PCA tell us the number of components needed to fit an input\n", "# fraction of the total variance\n", "desired_fraction = # set desired_fraction here\n", "pca= # instatiate the pca object using the desired_fraction keyword\n", "X_transformmed=) # do the PCA fit\n", "print('number of components needed: ', pca.n_components_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does this match what you got from your plots of variances?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's try an actual data set.\n", "
\n", "Let's work with the APOGEE DR17 data set, looking at abundances of multiple elements. First we will download data from the SDSS Catalog archive server using an SQL query with astroquery.\n", "
\n", "To learn about what is in the data base, you can peruse the \n", " schema browser . We will be interested in quantities from the aspcapStar table (select that from the Tables item on the right), specifically FE_H (which gives the logarithm of the abundance of iron to hydrogen relative to that of the Sun), MG_FE (logarithmic abundance of magnesium relative to irorn relative to that ratio in the Sun), O_FE, SI_FE, AL_FE, MN_FE, NI_FE, and LOGG (surface gravity). Construct a query that selects the abundances from this table for object with 1 < logg < 2. Remember, the basic structure of an SQL query:\n", "
\n",
    "   SELECT  columnnames\n",
    "   FROM  tablename\n",
    "   WHERE conditions\n",
    "
\n", "To make sure all of the abundances are valid, required that each abundance be larger than -10 (bad values are flagged as -999). You can also select on apogeeStar.extratarg = 0 as we did last time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroquery.sdss import SDSS\n", "sql=\" \n", " \"\n", "\n", "print(sql)\n", "dr17=SDSS.query_sql(sql, data_release=17) \n", "print(len(dr17))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many objects did you get?\n", "
ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic question is whether there are significant correlations between the different elemental abundances. If so, you can represent the data with fewer quantities. We will use PCA to investigate this. First we need to bundle the data into a [nobjects,nabun] shaped array. You can use np.array() or np.vstack() to bundle the data, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(np.vstack([dr17['FE_H'],dr17['O_FE']]).T.shape)\n", "print(np.array([dr17['FE_H'],dr17['O_FE']]).T.shape)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a data array including all seven dimensions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X= # create data array\n", "\n", "# set up some labels for plots\n", "labels=['[Fe/H]','[O/Fe]','[Mg/Fe]','[Si/Fe]','[Al/Fe]','[Mn/Fe]','[Ni/Fe]']\n", "print(X.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, try to do a PCA decomposition as above" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#do the PCA decomposition\n", "pca= #instatiate a PCA object\n", "X_projected= # do the decomposition and get the transformed coordinates\n", "X_inv= # get the inverse transformation\n", "print(X.shape,X_projected.shape,X_inv.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the scree plots as above, with \n", "the explained_variance_, explained_variance_ratio_, and cumulative sum of explained_variance_ratio_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(pca.components_.shape)\n", "# make the plots\n", "plt.figure()\n", "plt.plot(...)\n", "plt.figure()\n", "plt.plot(...)\n", "plt.figure()\n", "plt.plot(...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many components are needed to epxlain the bulk (say 95%) of the total variance?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As above, look at the first few eigenvectors (as many as were needed to explain the bulk of the variance) and see if you can interpret anything. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# look at the individual components and the fraction of the variance that they explain\n", "print(pca.components_.shape)\n", "for i in range(...) : # how many components were needed?\n", " plt.plot(...,label='component {:d}'.format(i)) # plot the ith component\n", "plt.legend()\n", " \n", "plt.xticks(range(len(labels)),labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there some elemental abundances that go in lockstep with others?\n", "
\n", " ANSWER HERE: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }