The astroph.py Suite
Introduction
astroph.py and the accompanying files are designed to generate a freestanding, maintenance-free web site for listing arXiv.org files (or other URLs), as seen at astronomy.nmsu.edu/agso/coffee. A PHP-enabled web site lists submitted articles, has an input form for submitting additional articles of interest, and upon submission the site automatically updates the PHP code. Items can be subsequently deleted from the list using a password-protected PHP text editor.Astroph.py was created at UC Los Angeles by Ian J. Crossfield and Nathaniel Ross.
It was then heavily edited by Ryan T. Hamilton at New Mexico State University, and became this distributed version, available at https://bitbucket.org/astrobokonon/astrocoffee/ or grab a tar.gz from the Download Now button above.
Re-use or modification is allowed and encouraged, so long as proper acknowledgement is made to the original authors and institution.
History
2010-02-11 | IJC | v0.4 | First edition w/documentation |
2010-02-12 | IJC | v0.41 | Corrected "nexthursday" bug |
2010-04-04 | RTH | v0.75 | Added automatic archiving, comments via IntenseDebate, changed internal code structure |
2010-04-09 | RTH | v0.77 | Fixed arXiv date scraping bug |
2010-04-12 | RTH | v0.85 | Added ADS scraper, second edition w/documentation |
2010-05-04 | RTH | v0.90 | Fixed Nature scraper and added exception handling to keep the website going when a preprint object fails |
2010-05-09 | RTH | v0.91 | Added .htaccess file to restrict access and hopefully cut out spam submissions |
2010-05-24 | RTH | v0.92 | Added xxx.lanl.gov server checking for arXiv |
2010-07-19 | RTH | v0.95 | More exception handling to catch stray PDFs also changed error reporting to log file. |
The guts -- How does it work?
Astroph.py invokes a few fairly basic Python commands for URL retrieval, file I/O, and string manipulation, all wrapped up in a colorful HTML/PHP candy shell -- see the source code for more details. The PHP handles form submission and processing, and calls the underlying Python script that collects data from web pages and writes new HTML/PHP code.As currently implemented page updates can take several minutes, because a 20-second pause is inserted between retrieval of each submission's info. This is to prevent sites from recognizing the script as a rampaging robot.
Though it could be accomplished with regular expressions, the scraping of information from the html documents uses BeautifulSoup as a parser. See the interwebs for more info, particularly this link:
http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege/
Future revisions
Anyone with a mind toward improvements is welcome to attempt implementation of the following, or other, desired future features. As always, keep the original authors in the loop.- smarter handling of IO, file permissions, and general web security
- Someone with better understanding of PHP security can probably tell me which is better, to have the username/password in plaintext and the salt hidden or vice versa. I just don't see how to combine it all in a totally secure way at the moment
- A locally generated article information database, so that articles' data don't have to be re-loaded from the web each time the script is run. Doing this would allow a much shorter delay between processing each submission, and much quicker updating.
- Options for sorting by, e.g., web submission date, paper submission date, type, etc.