The astroph.py Suite

Introduction

astroph.py and the accompanying files are designed to generate a freestanding, maintenance-free web site for listing arXiv.org files (or other URLs), as seen at astronomy.nmsu.edu/agso/coffee. A PHP-enabled web site lists submitted articles, has an input form for submitting additional articles of interest, and upon submission the site automatically updates the PHP code. Items can be subsequently deleted from the list using a password-protected PHP text editor.

Astroph.py was created at UC Los Angeles by Ian J. Crossfield and Nathaniel Ross.

It was then heavily edited by Ryan T. Hamilton at New Mexico State University, and became this distributed version, available at https://bitbucket.org/astrobokonon/astrocoffee/ or grab a tar.gz from the Download Now button above.

Re-use or modification is allowed and encouraged, so long as proper acknowledgement is made to the original authors and institution.

History

2010-02-11 IJC v0.4 First edition w/documentation
2010-02-12 IJC v0.41 Corrected "nexthursday" bug
2010-04-04 RTH v0.75 Added automatic archiving, comments via IntenseDebate, changed internal code structure
2010-04-09 RTH v0.77 Fixed arXiv date scraping bug
2010-04-12 RTH v0.85 Added ADS scraper, second edition w/documentation
2010-05-04 RTH v0.90 Fixed Nature scraper and added exception handling to keep the website going when a preprint object fails
2010-05-09 RTH v0.91 Added .htaccess file to restrict access and hopefully cut out spam submissions
2010-05-24 RTH v0.92 Added xxx.lanl.gov server checking for arXiv
2010-07-19 RTH v0.95 More exception handling to catch stray PDFs also changed error reporting to log file.

The guts -- How does it work?

Astroph.py invokes a few fairly basic Python commands for URL retrieval, file I/O, and string manipulation, all wrapped up in a colorful HTML/PHP candy shell -- see the source code for more details. The PHP handles form submission and processing, and calls the underlying Python script that collects data from web pages and writes new HTML/PHP code.

As currently implemented page updates can take several minutes, because a 20-second pause is inserted between retrieval of each submission's info. This is to prevent sites from recognizing the script as a rampaging robot.

Though it could be accomplished with regular expressions, the scraping of information from the html documents uses BeautifulSoup as a parser. See the interwebs for more info, particularly this link:

http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege/

Future revisions

Anyone with a mind toward improvements is welcome to attempt implementation of the following, or other, desired future features. As always, keep the original authors in the loop.
  • smarter handling of IO, file permissions, and general web security
  • Someone with better understanding of PHP security can probably tell me which is better, to have the username/password in plaintext and the salt hidden or vice versa. I just don't see how to combine it all in a totally secure way at the moment
  • A locally generated article information database, so that articles' data don't have to be re-loaded from the web each time the script is run. Doing this would allow a much shorter delay between processing each submission, and much quicker updating.
  • Options for sorting by, e.g., web submission date, paper submission date, type, etc.