Fork me on GitHub

Erin Hengel

Software > Requests-Raven

Requests-Raven is a custom Requests class to log onto Raven, the University of Cambridge's central web authentication service. Requests-Raven includes subclasses to remotely access bibliographic information and PDFs stored on JSTOR, EBSCOhost and Wiley.

Installation

Install Requests-Raven with pip (probably as root):

$ pip install requests_raven

To install from source, download the latest version on GitHub.com and run the following command:

$ python setup.py install

If you are on a Mac, Python 2.7 is pre-installed by default; upgrade at python.org and follow the instructions in the Installation section of Textatistic to update the python command link.

Quickstart

The Raven class logs onto Raven and establishes a connection with the host. The session attribute returns a Requests Session object with all the methods of the main Requests API. For example, to establish a Raven connection object for the website qje.oxfordjournals.org:

>>> from requests_raven import Raven
>>> deets = {'userid': 'ab123', 'pwd': 'XXXX'}
>>> conn = Raven(url='http://qje.oxfordjournals.org', login=deets)

Then use session to access Requests methods.

>>> url = '{}/content/130/4/1623.full'.format(conn.url)
>>> request = conn.session.get(url)
>>> request.status_code
200

request.text contains the HTML code of the page you requested—which you can parse using, e.g., the Python module Beautiful Soup.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(request.text, 'html.parser')
>>> soup.title
<title>Behavioral Hazard in Health Insurance </title>
Accessing JSTOR, EBSCOhost and Wiley

JSTOR, EBSCOhost and Wiley are Raven subclasses specifically tailored to logon to jstor.org, ebscohost.com and onlinelibrary.wiley.com, respectively. Each subclass includes three methods: html fetches the HTML text, pdf downloads the PDF and ref returns bibliographic information for a given document identifier. For example, to establish a Raven connection object to JSTOR's database:

>>> from requests_raven import JSTOR
>>> conn = JSTOR(login=deets)

Fetch the HTML of the webpage devoted to the article identified by 10.1068/682574 in JSTOR's database:

>>> doc_id = '10.1086/682574'
>>> html = conn.html(id=doc_id)
>>> html[239:400]
'Per Krusell, Anthony A. Smith Jr., Is Piketty’s “Second Law of Capitalism” Fundamental?, Journal of Political Economy, Vol. 123, No. 4 (August 2015), pp. 725-748'

Download the document's PDF and bibliographic information.

>>> pdf = conn.pdf(id=doc_id, file='article.pdf')
>>> biblio = conn.ref(id=doc_id)
>>> biblio['authors']
[{'name': 'Per Krusell'}, {'name': 'Anthony A. Smith'}]

Raven

Establishing a connection to the host URL is very simple. For example, use the Raven class to access the Quarterly Journal of Economics on Oxford Journals. First, import Raven from the requests_raven module.

>>> from requests_raven import Raven

Next, establish a connection object, using the keyword arguments url to point to the restricted site—qje.oxfordjournals.org—and login to denote a dictionary containing userid and pwd: userid refers to your CRSid and pwd to your Raven password. Python prompts you if either are omitted.

>>> deets = {'userid': 'ab123', 'pwd': 'XXXX'}
>>> conn = Raven(url='http://qje.oxfordjournals.org', login=deets)

conn contains two methods. The first is the host's url. Except it won't look anything like the url you originally supplied. It should look something like the following:

>>> conn.url
'http://libsta28.lib.cam.ac.uk:2924'

This is an EZproxy server. It gives you remote access to restricted websites by making it look like you're using a university library computer. Note that the server you're getting may look very different to this one, and the port number especially is unlikely to be the same. From now on, always use conn.url instead of referring to the original URL.

Requests Session object

The second method in conn is session. session references the original Requests Session object created by Raven—and therefore all of its methods. One method in session is status_code, which returns the response code after a GET request of, e.g., the JSTOR webpage for "Behavioral Hazard in Health Insurance”.

>>> proxy_url = '{}/content/130/4/1623.full'.format(conn.url)
>>> request = conn.session.get(proxy_url)
>>> request.status_code
200

If you know the DOI of an article, you can pass that as a parameter to Oxford Journal's Search function.

>>> proxy_url = '{}/search'.format(conn.url)
>>> payload = {'submit': 'yes', 'doi': '10.1093/qje/qjv029'}
>>> request = conn.session.get(proxy_url, params=payload)

request contains a list of possible matches—but given you've specified a (unique) DOI, this list should have only one item in it. To find a link to the article's PDF, parse the HTML using Requests' text method and BeautifulSoup:

>>> soup = BeautifulSoup(request.text, 'html.parser')
>>> rel_link = soup.find(attrs={'rel': 'full-text.pdf'})['href']
>>> rel_link
'/content/130/4/1623.full.pdf+html?sid=059ca79d-20f2-45d5-b5ad-d8789e235110'

Sometimes rel_link contains an absolute link, but that’s not the case here, meaning we'll have to manually build the link ourselves.

>>> link = conn.url + rel_link
>>> link
'http://libsta28.lib.cam.ac.uk:2314/content/130/4/1623.full.pdf+html?sid=059ca79d-20f2-45d5-b5ad-d8789e235110'

This link actually directs you a split screen: PDF on the left and meta data and other content related to the article on the right. To get just the PDF, replace pdf+html in the url with pdf and use session to make another GET request.

>>> pdf_link = link.replace('pdf+html', 'pdf')
>>> pdf_request = conn.session.get(pdf_link)

Finally, to download the PDF to your local disk, save the content in pdf_request as a binary file.

>>> fh = open('article.pdf', 'wb')
>>> fh.write(pdf_request.content)
>>> fh.close()

You can get a lot of article meta-data this way. Peruse each website's HTML to figure out exactly how to extract the information you need.

JSTOR

JSTOR is a special Raven class to connect to JSTOR's database. It comes pre-packaged with several methods for downloading article PDFs and accessing bibliographic data.

To use any of the JSTOR-specific methods, first establish a connection object using JSTOR:

>>> from requests_raven import JSTOR
>>> conn = JSTOR(login=deets)
Download PDFs

To download the PDF of a particular article, you'll need the article's JSTOR id—i.e., everything after the work "stable" in its "stable URL". The stable URL is generally found on the first page of a PDF downloaded from JSTOR or on the top of the webpage dedicated to it. Finding the stable URL For example, the article "Superstar CEOs" in the November 2009 issue of the Quarterly Journal of Economics is http://www.jstor.org/stable/40506267, so the document id is 40506267 (see image on right).

To fetch the PDF, use the pdf method and indicate the JSTOR document identifier using the id keyword argument. If you supply a file name using the file keyword argument, the PDF is automatically saved for you.

>>> pdf = conn.pdf(id=40506267, file='article.pdf')

Otherwise, you’ll need to manually save a copy of the PDF using Python's I/O functions.

>>> fh = open('article.pdf', 'wb')
>>> fh.write(pdf)
>>> fh.close()
Download bibliographic data

The ref method returns a dictionary of bibliographic data for the document. The information it returns corresponds to whatever citation data is available via JSTOR's export citation function. Usually this includes the journal, publisher, authors, etc. Sometimes it includes the abstract. As before, use the id keyword argument to fill in the relevant document identifier.

>>> ref = conn.ref(id=40506267)
>>> ref['authors']
{'author': ['Ulrike Malmendier', 'Geoffrey Tate']}
>>> ref['abstract']
'Compensation, status, and press coverage of managers in the United States follow a highly skewed distribution: a small number of "superstars" enjoy the bulk of the rewards. We evaluate the impact of CEOs achieving superstar status on the performance of their firms, using prestigious business awards to measure shocks to CEO status. We find that award-winning CEOs subsequently underperform, both relative to their prior performance and relative to a matched sample of non-winning CEOs. At the same time, they extract more compensation following the awards, both in absolute amounts and relative to other top executives in their firms. They also spend more time on public and private activities outside their companies, such as assuming board seats or writing books. The incidence of earnings management increases after winning awards. The effects are strongest in firms with weak corporate governance. Our results suggest that the ex post consequences of media-induced superstar status for shareholders are negative.'

The ref method includes the optional keyword argument affiliation. By default, affiliation is false; when true, ref parses the HTML on the document's JSTOR site for the universities/institutions its authors are affiliated with. The following example illustrates with ”Equilibrium Imitation and Growth” (Journal of Political Economy, 2014).

>>> ref = conn.ref(id='10.1086/674362', affiliation=True)
>>> ref['authors']
[{'affiliation': 'University of British Columbia', 'name': 'Jesse Perla'},
{'affiliation': 'Stanford University', 'name': 'Christopher Tonetti'}]

JSTOR has codified authors’ institutions for only a small number of journal articles. If no affiliation is found, just the name of the author is returned.

Download HTML

Finally, the html method returns the raw HTML code from an article's JSTOR page. This might be useful for parsing the entire text of an article (when available) or searching for metadata not returned by ref. Using "Equilibrium Imitation and Growth" again and BeautifulSoup, the following example returns the value of the name attribute for the first meta HTML tag.

>>> from bs4 import BeautifulSoup
>>> html = conn.html(id='10.1086/674362')
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.meta['name']
'robots'

EBSCOhost

Another special Raven class included in Requests-Raven is EBSCOhost. Just like JSTOR, EBSCOhost is prepackaged with the pdf, ref and html methods. To use it, establish a connection object, first.

>>> from requests_raven import EBSCOhost
>>> conn = EBSCOhost(login=deets)

EBSCOhost's version of html, pdf and ref all take the optional db keyword argument, which refers to the "Database Short Name" for a given document. By default, db is set to btn (Business Source Complete); to change it, see the list of other databases supported by EBSCOhost’s API.

Download PDFs

To download a PDF from EBSCOhost's database, you'll need what's called an "Accession number". It's listed at the bottom of an article's bibliographic information—highlighted in the image below for "The Strategic Bequest Motive" (Journal of Political Economy, 1985). Accession number for atricles in EBSCOhost.

To grab the PDF, call the pdf method using your connection object conn. Save the file locally by specifying a file name with the file keyword argument.

>>> pdf_contents = conn.pdf(id=5190842, file='article.pdf')
Download bibliographic data

Downloading bibliographic information from EBSCOhost is done in a manner very similar to that used with JSTOR. Simply call the ref method on the conn connection object.

>>> biblio = conn.ref(id=5190842)
>>> biblio['subject']
['SAVING & investment', 'INHERITANCE & succession', 'ECONOMETRICS', 'ECONOMICS', 'MATHEMATICAL economics', 'POLICYHOLDERS', "SURVIVORS' benefits", 'BENEFICIARIES']

Because authors’ affiliations are automatically returned by EBSCOhost's bibliography export function (when available), EBSCOhost does not include the keyword argument affiliation.

Download HTML

To download the raw html of an EBSCOhost web page associated with an article, call the html method:

>>> html_content = conn.html(id=5190842)
>>> html_content[93:121]
'The Strategic Bequest Motive'

Wiley

The final Raven subclass is Wiley, which accesses data and publications from Wiley Online Library. As with JSTOR and EBSCOhost, it contains three methods: one for downloading PDFs, another for returning bibliographic details and a final method for fetching raw HTML from an article's Wiley webpage.

As before, first establish a connection object.

>>> from requests_raven import Wiley
>>> conn = Wiley(login=deets)
Download PDFs

To download PDFs, you'll need an article’s Digital Object Identifier (DOI)—highlighted in the image on the right for the 2003 Econometrica article "Co-operation in Repeated Games When the Number of Stages is Not Commonly Known”. DOI on Wiley Online Library Once you have it, call the pdf method using the conn connection object. To save it locally, specify a file name.

>>> pdf_contents = conn.pdf(id='10.1111/1468-0262.00003', file='article.pdf')
Download bibliographic data

Again, call the ref method on the conn connection object to fetch bibliographic information. Like JSTOR, set the affiliation keyword to True to return the institutions associated with each author.

>>> biblio = conn.ref(id='10.1111/1468-0262.00003', affiliation=True)
>>> biblio['authors']
[{'name': 'Neyman, Abraham', 'affiliation': 'Institute of Mathematics, The Hebrew University of Jerusalem'}]
Download HTML

To download the raw html of a web page associated with an article, call the html method:

>>> html_content = conn.html(id='10.1111/1468-0262.00003')
>>> html_content[190:321]
'Cooperation in Repeated Games When the Number of Stages is not Commonly Known - Neyman - 2003 - Econometrica - Wiley Online Library'

Oxford Journals

Requests-Raven also contains the OxfordQJE subclass to access the Quarterly Journal of Economics on Oxford Journals. The methods available are identical to those in JSTOR and Wiley. Because the search function on Oxford Journals is journal specific, OxfordQJE doesn't apply to all of Oxford Journals. However, with slight modification, it should properly fetch PDFs and parse HTML for other journals in their database, as well. Feel free to modify the existing code to do this (and please update the code on GitHub.com!).

License

Copyright 2015 Erin Hengel

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.