{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SELECT columnnames\n", " FROM tablename\n", " WHERE conditions\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroquery.sdss import SDSS\n", "sql=\" \n", " \"\n", "print(sql)\n", "dr17=SDSS.query_sql(sql, data_release=17) \n", "print(len(dr17))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many objects did you get?\n", "
\n", "For your scatter plot, choose the smallest point sizes (s= keyword with plt.scatter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gd=np.where((dr17['FE_H']>-900) & (dr17['MG_FE']>-900) & np.isfinite(dr17['FE_H']*dr17['MG_FE']))[0]\n", "dr17=dr17[gd]\n", "print(len(dr17))\n", "\n", "# use a point size of 1 for the scatter plot\n", "plt.scatter(...) # fill in plot quantities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that it's hard to see structure, even with the smallest point sizes. So let's use density estimation to display the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start with a Gaussian KDE using scikit-learn routine: see http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity. As with many routines that we'll use, this wants an input array of dimesions [npts, ndim], where ndim is the number dimensions of the data. For us, we'll use two dimensions (FE_H and MG_FE), so want a [npts,2] array.\n", "
\n",
"As you will see, working with so many data points can take a non-negligible amount of time, especially on a laptop (always consider using appropriate hardware when working with large data sets - your laptop may not be the best choice!). To help with this, we'll set a sampling variable, nsamp, and we'll just take every nsamp'th point in the cells below: you can set nsamp=1 to use every point, but you might want to start with smaller samples, eg. nsamp=10, so you're not waiting too long!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nsamp=10 # value for sampling of data point, to speed things up!\n",
"\n",
"# put the data into a [npts,2] array, sampling every nsamp point\n",
"X=np.array([dr17['FE_H'][::nsamp],dr17['MG_FE'][::nsamp]]).T\n",
"print(X.shape)\n",
"\n",
"# plot it\n",
"plt.scatter(X[:,0],X[:,1],s=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, let's set up the KDE . You need to specify a kernel type, and a bandwidth"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# do the KDE\n",
"from sklearn.neighbors import KernelDensity\n",
"kernel= # choose kernel shape\n",
"bandwidth = # choose bandwidth\n",
"kde=KernelDensity(kernel=kernel,bandwidth=bandwidth)\n",
"\n",
"# do the fit\n",
"kde.fit(X)\n",
"\n",
"# sample it on a grid as specified by these x and y values\n",
"x=np.linspace(-2.5,1,100)\n",
"y=np.linspace(-0.6,0.8,100)\n",
"\n",
"# meshgrid returns two 2D arrays with the values\n",
"yg,xg=np.meshgrid(x,y)\n",
"print('xg shape: ',xg.shape)\n",
"print('yg shape: ',yg.shape)\n",
"# ravel turns the 2D arrays into the long lists of (npts,2) that we need\n",
"xy=np.vstack([yg.ravel(),xg.ravel()]).T\n",
"print('xy:',xy.shape)\n",
"\n",
"# get the KDE samples at the grid locations and reshape it back to 100x100\n",
"dens=kde.score_samples(xy).reshape(xg.shape)\n",
"print('dens shape: ',dens.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the KDE results as an image, using imshow()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.imshow(dens,origin='lower',interpolation='nearest',vmin=-5,vmax=2,\n",
" extent=(x[0],x[-1],y[0],y[-1]),aspect='equal')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Experiment with using different bandwidths and, if you want, more data points by adjusting nsamp. Describe your experiments. What bandwidth do you prefer?\n",
"\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"scikit-learn has some nice built-in tools. In particular, it has a generic grid search routine which can sample a range of parameters for any of a number of algorithms, and determine the best parameters using cross validation. Here, we will use cross-validation to get optimal smoothing size. See http://scikit-learn.org/stable/modules/grid_search.html#grid-search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"# note that GridSearchCV has its own parameters for choosing cross-validation samples, etc.\n",
"# here, we set the grid of bandwidth parameters for it to try\n",
"bandwidths= # set array of bandwidths to try -- don't try too many!\n",
"grid=GridSearchCV(KernelDensity(),{'bandwidth': bandwidths})\n",
"grid.fit(X)\n",
"\n",
"# check out some of the output from the grid search\n",
"import pprint\n",
"pprint.pprint(grid.cv_results_)\n",
"print('best parameters from CV: ',grid.best_params_)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the KDE with the \"best\" bandwidth from the cross-validation test and display"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"kde=KernelDensity(kernel='gaussian',bandwidth=grid.best_params_['bandwidth'])\n",
"kde.fit(X)\n",
"\n",
"# get the KDE samples at the grid locations\n",
"dens=kde.score_samples(xy).reshape(xg.shape)\n",
"plt.imshow(dens,origin='lower',interpolation='nearest',vmin=-5,vmax=2,\n",
" extent=(-2.5,1,-0.6,0.8),aspect='equal')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does the cross-validation result compare with your preferred bandwidth?\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now try a nearest neighbors density estimation, http://www.astroml.org/modules/generated/astroML.density_estimation.KNeighborsDensity.html.\n",
"Here we will smooth each point with a width depending on a specified number of nearest neighbors. Experiment with different numbers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from astroML.density_estimation import KNeighborsDensity\n",
"\n",
"nneighbors= # number of neighbors for smoothing\n",
"knd=KNeighborsDensity('bayesian',nneighbors)\n",
"knd.fit(X)\n",
"\n",
"# again, evaluate on our grid and display as an image\n",
"dens=np.array(knd.eval(xy)).reshape(xg.shape)\n",
"dens=np.log(dens)\n",
"\n",
"plt.imshow(dens,origin='lower',interpolation='nearest',extent=(-2.5,1,-0.6,0.8),aspect='equal',\n",
" vmin=5,vmax=12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Describe your results and preferred values.\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, now let's look at a parametric estimator, Gaussian mixtures, see http://scikit-learn.org/stable/modules/mixture.html. Here we are going to try multiple numbers of components and calculate the AIC and BIC for each, for comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.mixture import GaussianMixture\n",
"\n",
"ngauss= # set rray of number of gaussian components to try\n",
"\n",
"nplots=len(ngauss)\n",
"# setup plot grid\n",
"fig,ax=plt.subplots(nplots//3+1,3,figsize=(16,12))\n",
"fig.subplots_adjust(hspace=0.001,wspace=0.001)\n",
"\n",
"# initialize accumulators for AIC and BIC\n",
"aic=[]\n",
"bic=[]\n",
"\n",
"# loop over the different numbers of components\n",
"for iplot,n in enumerate(ngauss) :\n",
" gmm=GaussianMixture(n+1)\n",
" gmm.fit(X)\n",
" dens=gmm.score_samples(xy).reshape(xg.shape)\n",
" \n",
" # calculate and store AIC and BIC\n",
" print('n',n,'BIC',gmm.bic(X),'AIC',gmm.aic(X))\n",
" aic.append(gmm.aic(X))\n",
" bic.append(gmm.bic(X))\n",
"\n",
" # show the mixture model\n",
" ax[iplot//3,iplot%3].imshow(dens,origin='lower',interpolation='nearest',extent=(-2.5,1,-0.6,0.8),\n",
" aspect='equal',vmin=-5,vmax=2)\n",
" # plot the location of the component centers\n",
" ax[iplot//3,iplot%3].plot(gmm.means_[:,0],gmm.means_[:,1],'ro')\n",
" \n",
"\n",
"#plot AIC/BIC vs number of components\n",
"plt.figure()\n",
"plt.plot(ngauss,aic,label='AIC')\n",
"plt.plot(ngauss,bic,label='BIC')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many components are preferred by AIC/BIC? How many components do you prefer?\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, now we try extreme deconvolution, which attempts to get an underlying gaussian mixture which, when convolved with uncertainties, matches the observed data. Now we have to pass a covariance matrix for each point; for simplicity here, we take a simple covariance matrix with an uncertainty of 0.02 in each coordinate. See http://www.astroml.org/user_guide/density_estimation.html#extreme-deconvolution\n",
"\n",
"
\n",
"WARNING: This can take some time!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'X' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m \n",
"Experiment with different numbers of clusters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import KMeans\n",
"n_clusters= # set number of clusters\n",
"clf=KMeans(n_clusters=n_clusters)\n",
"clf.fit(X)\n",
"centers=clf.cluster_centers_\n",
"#print(centers)\n",
"labels=clf.predict(X)\n",
"\n",
"#set up an array of color/point types to cycle through for different clusters\n",
"color=['ro','go','bo','co','yo','mo','bo','r^','g^','b^','c^','y^','m^','b^']\n",
"for i in range(n_clusters) :\n",
" gd=np.where(labels == i)[0]\n",
" plt.plot(X[gd,0],X[gd,1],color[i])\n",
"plt.ylim(-0.5,0.8)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do you think about the results?\n",
" \n",
"WARNING: can take a long time!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import MeanShift\n",
"ms = MeanShift(bandwidth= ) # set bandwidth\n",
"ms.fit(X)\n",
"centers=ms.cluster_centers_\n",
"labels=ms.labels_\n",
"print(len(labels))\n",
"print(len(centers))\n",
"for i in range(len(centers)) :\n",
" gd = np.where(labels == i)[0]\n",
" plt.scatter(X[gd,0],X[gd,1],color[i%14])\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" What do you think about the results?\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, now let's try some clustering routines, specifically, K means. Remember, for K means you need to specify the number of clusters for it to separate object into. We will use the sklean.cluster.KMeans() routine.\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try the MeanShift routine. Here it will choose the number of clusters, but you have to choose the bandwidth, which will affect the results.\n",
"
ANSWER HERE: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}