{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Density estimation: nonparametric and parametric" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's work with the APOGEE DR17 data set, looking in 2D at [Mg/Fe] vs [Fe/H]. First we will download data from the SDSS Catalog archive server using an SQL query with astroquery.\n", "
\n", "To learn about what is in the data base, you can peruse the \n", " schema browser . We will be interested in quantities from the aspcapStar table (select that from the Tables item on the right to browse the contents), specifically FE_H (which gives the logarithm of the abundance of iron to hydrogen relative to that of the Sun), MG_FE (logarithmic abundance of magnesium relative to irorn relative to that ratio in the Sun), and LOGG (surface gravity). Construct a query that selects FE_H and MG_FE from this table for object with 1 < logg < 2. Remember, the basic structure of an SQL query:\n", "
\n",
    "   SELECT  columnnames\n",
    "   FROM  tablename\n",
    "   WHERE conditions\n",
    "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroquery.sdss import SDSS\n", "sql=\" \n", " \"\n", "print(sql)\n", "dr17=SDSS.query_sql(sql, data_release=17) \n", "print(len(dr17))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many objects did you get?\n", "
ANSWER HERE:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, let's make a scatter plot of FE_H vs MG_FE. Note that bad values are filled with -999, so let's clean the data by restricting our data to those with good values. Let's also make sure we don't have any NaN or inf. Note that we could have included these constraints in the SQL query!\n", "

\n", "For your scatter plot, choose the smallest point sizes (s= keyword with plt.scatter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gd=np.where((dr17['FE_H']>-900) & (dr17['MG_FE']>-900) & np.isfinite(dr17['FE_H']*dr17['MG_FE']))[0]\n", "dr17=dr17[gd]\n", "print(len(dr17))\n", "\n", "# use a point size of 1 for the scatter plot\n", "plt.scatter(...) # fill in plot quantities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that it's hard to see structure, even with the smallest point sizes. So let's use density estimation to display the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start with a Gaussian KDE using scikit-learn routine: see http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity. As with many routines that we'll use, this wants an input array of dimesions [npts, ndim], where ndim is the number dimensions of the data. For us, we'll use two dimensions (FE_H and MG_FE), so want a [npts,2] array.\n", "

\n", "As you will see, working with so many data points can take a non-negligible amount of time, especially on a laptop (always consider using appropriate hardware when working with large data sets - your laptop may not be the best choice!). To help with this, we'll set a sampling variable, nsamp, and we'll just take every nsamp'th point in the cells below: you can set nsamp=1 to use every point, but you might want to start with smaller samples, eg. nsamp=10, so you're not waiting too long!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nsamp=10 # value for sampling of data point, to speed things up!\n", "\n", "# put the data into a [npts,2] array, sampling every nsamp point\n", "X=np.array([dr17['FE_H'][::nsamp],dr17['MG_FE'][::nsamp]]).T\n", "print(X.shape)\n", "\n", "# plot it\n", "plt.scatter(X[:,0],X[:,1],s=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, let's set up the KDE . You need to specify a kernel type, and a bandwidth" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# do the KDE\n", "from sklearn.neighbors import KernelDensity\n", "kernel= # choose kernel shape\n", "bandwidth = # choose bandwidth\n", "kde=KernelDensity(kernel=kernel,bandwidth=bandwidth)\n", "\n", "# do the fit\n", "kde.fit(X)\n", "\n", "# sample it on a grid as specified by these x and y values\n", "x=np.linspace(-2.5,1,100)\n", "y=np.linspace(-0.6,0.8,100)\n", "\n", "# meshgrid returns two 2D arrays with the values\n", "yg,xg=np.meshgrid(x,y)\n", "print('xg shape: ',xg.shape)\n", "print('yg shape: ',yg.shape)\n", "# ravel turns the 2D arrays into the long lists of (npts,2) that we need\n", "xy=np.vstack([yg.ravel(),xg.ravel()]).T\n", "print('xy:',xy.shape)\n", "\n", "# get the KDE samples at the grid locations and reshape it back to 100x100\n", "dens=kde.score_samples(xy).reshape(xg.shape)\n", "print('dens shape: ',dens.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the KDE results as an image, using imshow()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.imshow(dens,origin='lower',interpolation='nearest',vmin=-5,vmax=2,\n", " extent=(x[0],x[-1],y[0],y[-1]),aspect='equal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment with using different bandwidths and, if you want, more data points by adjusting nsamp. Describe your experiments. What bandwidth do you prefer?\n", "\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "scikit-learn has some nice built-in tools. In particular, it has a generic grid search routine which can sample a range of parameters for any of a number of algorithms, and determine the best parameters using cross validation. Here, we will use cross-validation to get optimal smoothing size. See http://scikit-learn.org/stable/modules/grid_search.html#grid-search" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "# note that GridSearchCV has its own parameters for choosing cross-validation samples, etc.\n", "# here, we set the grid of bandwidth parameters for it to try\n", "bandwidths= # set array of bandwidths to try -- don't try too many!\n", "grid=GridSearchCV(KernelDensity(),{'bandwidth': bandwidths})\n", "grid.fit(X)\n", "\n", "# check out some of the output from the grid search\n", "import pprint\n", "pprint.pprint(grid.cv_results_)\n", "print('best parameters from CV: ',grid.best_params_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the KDE with the \"best\" bandwidth from the cross-validation test and display" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kde=KernelDensity(kernel='gaussian',bandwidth=grid.best_params_['bandwidth'])\n", "kde.fit(X)\n", "\n", "# get the KDE samples at the grid locations\n", "dens=kde.score_samples(xy).reshape(xg.shape)\n", "plt.imshow(dens,origin='lower',interpolation='nearest',vmin=-5,vmax=2,\n", " extent=(-2.5,1,-0.6,0.8),aspect='equal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the cross-validation result compare with your preferred bandwidth?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now try a nearest neighbors density estimation, http://www.astroml.org/modules/generated/astroML.density_estimation.KNeighborsDensity.html.\n", "Here we will smooth each point with a width depending on a specified number of nearest neighbors. Experiment with different numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroML.density_estimation import KNeighborsDensity\n", "\n", "nneighbors= # number of neighbors for smoothing\n", "knd=KNeighborsDensity('bayesian',nneighbors)\n", "knd.fit(X)\n", "\n", "# again, evaluate on our grid and display as an image\n", "dens=np.array(knd.eval(xy)).reshape(xg.shape)\n", "dens=np.log(dens)\n", "\n", "plt.imshow(dens,origin='lower',interpolation='nearest',extent=(-2.5,1,-0.6,0.8),aspect='equal',\n", " vmin=5,vmax=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Describe your results and preferred values.\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's look at a parametric estimator, Gaussian mixtures, see http://scikit-learn.org/stable/modules/mixture.html. Here we are going to try multiple numbers of components and calculate the AIC and BIC for each, for comparison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture\n", "\n", "ngauss= # set rray of number of gaussian components to try\n", "\n", "nplots=len(ngauss)\n", "# setup plot grid\n", "fig,ax=plt.subplots(nplots//3+1,3,figsize=(16,12))\n", "fig.subplots_adjust(hspace=0.001,wspace=0.001)\n", "\n", "# initialize accumulators for AIC and BIC\n", "aic=[]\n", "bic=[]\n", "\n", "# loop over the different numbers of components\n", "for iplot,n in enumerate(ngauss) :\n", " gmm=GaussianMixture(n+1)\n", " gmm.fit(X)\n", " dens=gmm.score_samples(xy).reshape(xg.shape)\n", " \n", " # calculate and store AIC and BIC\n", " print('n',n,'BIC',gmm.bic(X),'AIC',gmm.aic(X))\n", " aic.append(gmm.aic(X))\n", " bic.append(gmm.bic(X))\n", "\n", " # show the mixture model\n", " ax[iplot//3,iplot%3].imshow(dens,origin='lower',interpolation='nearest',extent=(-2.5,1,-0.6,0.8),\n", " aspect='equal',vmin=-5,vmax=2)\n", " # plot the location of the component centers\n", " ax[iplot//3,iplot%3].plot(gmm.means_[:,0],gmm.means_[:,1],'ro')\n", " \n", "\n", "#plot AIC/BIC vs number of components\n", "plt.figure()\n", "plt.plot(ngauss,aic,label='AIC')\n", "plt.plot(ngauss,bic,label='BIC')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many components are preferred by AIC/BIC? How many components do you prefer?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now we try extreme deconvolution, which attempts to get an underlying gaussian mixture which, when convolved with uncertainties, matches the observed data. Now we have to pass a covariance matrix for each point; for simplicity here, we take a simple covariance matrix with an uncertainty of 0.02 in each coordinate. See http://www.astroml.org/user_guide/density_estimation.html#extreme-deconvolution\n", "\n", "

\n", "WARNING: This can take some time!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "ename": "NameError", "evalue": "name 'X' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mn_components\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mxdgmm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mXDGMM\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_components\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn_components\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mnobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;31m# set covariance matrix for uncertainties\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'X' is not defined" ] } ], "source": [ "from astroML.density_estimation import XDGMM\n", "\n", "# set number of components\n", "n_components=3\n", "xdgmm = XDGMM(n_components=n_components)\n", "nobs=X.shape[0]\n", "\n", "# set covariance matrix for uncertainties\n", "Xerr = np.zeros([nobs,2,2])\n", "Xerr[:] = np.diag([0.02,0.02])\n", "xdgmm.fit(X, Xerr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results, both as samples drawn from the extreme deconvolution model, as well as the model ellipses themselves." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data on top\n", "fig,ax=plt.subplots(2,1,sharex=True,sharey=True)\n", "ax[0].scatter(X[:,0],X[:,1],s=2)\n", "ax[0].set_xlim(-2.5,1)\n", "ax[0].set_ylim(-0.3,0.5)\n", "\n", "# sammpling from mmodel\n", "dens=xdgmm.sample(nobs)\n", "ax[1].scatter(dens[:,0],dens[:,1],s=2)\n", "\n", "from astroML.plotting.tools import draw_ellipse\n", "for i in range(n_components) :\n", " draw_ellipse(xdgmm.mu[i],xdgmm.V[i],scales=[2],ax=ax[1],ec='k',fc='gray',alpha=0.4)\n", "\n", "print(xdgmm.mu)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think about the results:\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's try some clustering routines, specifically, K means. Remember, for K means you need to specify the number of clusters for it to separate object into. We will use the sklean.cluster.KMeans() routine.\n", "

\n", "Experiment with different numbers of clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "n_clusters= # set number of clusters\n", "clf=KMeans(n_clusters=n_clusters)\n", "clf.fit(X)\n", "centers=clf.cluster_centers_\n", "#print(centers)\n", "labels=clf.predict(X)\n", "\n", "#set up an array of color/point types to cycle through for different clusters\n", "color=['ro','go','bo','co','yo','mo','bo','r^','g^','b^','c^','y^','m^','b^']\n", "for i in range(n_clusters) :\n", " gd=np.where(labels == i)[0]\n", " plt.plot(X[gd,0],X[gd,1],color[i])\n", "plt.ylim(-0.5,0.8)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think about the results?\n", "
ANSWER HERE: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the MeanShift routine. Here it will choose the number of clusters, but you have to choose the bandwidth, which will affect the results.\n", "

\n", "WARNING: can take a long time!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import MeanShift\n", "ms = MeanShift(bandwidth= ) # set bandwidth\n", "ms.fit(X)\n", "centers=ms.cluster_centers_\n", "labels=ms.labels_\n", "print(len(labels))\n", "print(len(centers))\n", "for i in range(len(centers)) :\n", " gd = np.where(labels == i)[0]\n", " plt.scatter(X[gd,0],X[gd,1],color[i%14])\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " What do you think about the results?\n", "
ANSWER HERE: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }