Software > Requests-Raven
Requests-Raven is a custom Requests class to log onto Raven, the University of Cambridge's central web authentication service. Requests-Raven includes subclasses to remotely access bibliographic information and PDFs stored on JSTOR, EBSCOhost and Wiley.
Installation
Install Requests-Raven with pip
(probably as root):
$ pip install requests_raven
To install from source, download the latest version on GitHub.com and run the following command:
$ python setup.py install
If you are on a Mac, Python 2.7 is pre-installed by default; upgrade at python.org and follow the instructions in the Installation section of Textatistic to update the python
command link.
Quickstart
The Raven
class logs onto Raven and establishes a connection with the host. The session
attribute returns a Requests Session object with all the methods of the main Requests API. For example, to establish a Raven
connection object for the website qje.oxfordjournals.org:
>>> from requests_raven import Raven
>>> deets = {'userid': 'ab123', 'pwd': 'XXXX'}
>>> conn = Raven(url='http://qje.oxfordjournals.org', login=deets)
Then use session
to access Requests methods.
>>> url = '{}/content/130/4/1623.full'.format(conn.url)
>>> request = conn.session.get(url)
>>> request.status_code
200
request.text
contains the HTML code of the page you requested—which you can parse using, e.g., the Python module Beautiful Soup.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(request.text, 'html.parser')
>>> soup.title
<title>Behavioral Hazard in Health Insurance </title>
Accessing JSTOR, EBSCOhost and Wiley
JSTOR
, EBSCOhost
and Wiley
are Raven
subclasses specifically tailored to logon to jstor.org, ebscohost.com and onlinelibrary.wiley.com, respectively. Each subclass includes three methods: html
fetches the HTML text, pdf
downloads the PDF and ref
returns bibliographic information for a given document identifier. For example, to establish a Raven
connection object to JSTOR's database:
>>> from requests_raven import JSTOR
>>> conn = JSTOR(login=deets)
Fetch the HTML of the webpage devoted to the article identified by 10.1068/682574 in JSTOR's database:
>>> doc_id = '10.1086/682574'
>>> html = conn.html(id=doc_id)
>>> html[239:400]
'Per Krusell, Anthony A. Smith Jr., Is Piketty’s “Second Law of Capitalism” Fundamental?, Journal of Political Economy, Vol. 123, No. 4 (August 2015), pp. 725-748'
Download the document's PDF and bibliographic information.
>>> pdf = conn.pdf(id=doc_id, file='article.pdf')
>>> biblio = conn.ref(id=doc_id)
>>> biblio['authors']
[{'name': 'Per Krusell'}, {'name': 'Anthony A. Smith'}]
Raven
Establishing a connection to the host URL is very simple. For example, use the Raven
class to access the Quarterly Journal of Economics on Oxford Journals. First, import Raven
from the requests_raven
module.
>>> from requests_raven import Raven
Next, establish a connection object, using the keyword arguments url
to point to the restricted site—qje.oxfordjournals.org—and login
to denote a dictionary containing userid
and pwd
: userid
refers to your CRSid and pwd
to your Raven password. Python prompts you if either are omitted.
>>> deets = {'userid': 'ab123', 'pwd': 'XXXX'}
>>> conn = Raven(url='http://qje.oxfordjournals.org', login=deets)
conn
contains two methods. The first is the host's url
. Except it won't look anything like the url you originally supplied. It should look something like the following:
>>> conn.url
'http://libsta28.lib.cam.ac.uk:2924'
This is an EZproxy server. It gives you remote access to restricted websites by making it look like you're using a university library computer. Note that the server you're getting may look very different to this one, and the port number especially is unlikely to be the same. From now on, always use conn.url
instead of referring to the original URL.
Requests Session object
The second method in conn
is session
. session
references the original Requests Session object created by Raven
—and therefore all of its methods. One method in session
is status_code
, which returns the response code after a GET request of, e.g., the JSTOR webpage for "Behavioral Hazard in Health Insurance”.
>>> proxy_url = '{}/content/130/4/1623.full'.format(conn.url)
>>> request = conn.session.get(proxy_url)
>>> request.status_code
200
If you know the DOI of an article, you can pass that as a parameter to Oxford Journal's Search function.
>>> proxy_url = '{}/search'.format(conn.url)
>>> payload = {'submit': 'yes', 'doi': '10.1093/qje/qjv029'}
>>> request = conn.session.get(proxy_url, params=payload)
request
contains a list of possible matches—but given you've specified a (unique) DOI, this list should have only one item in it. To find a link to the article's PDF, parse the HTML using Requests' text
method and BeautifulSoup:
>>> soup = BeautifulSoup(request.text, 'html.parser')
>>> rel_link = soup.find(attrs={'rel': 'full-text.pdf'})['href']
>>> rel_link
'/content/130/4/1623.full.pdf+html?sid=059ca79d-20f2-45d5-b5ad-d8789e235110'
Sometimes rel_link
contains an absolute link, but that’s not the case here, meaning we'll have to manually build the link ourselves.
>>> link = conn.url + rel_link
>>> link
'http://libsta28.lib.cam.ac.uk:2314/content/130/4/1623.full.pdf+html?sid=059ca79d-20f2-45d5-b5ad-d8789e235110'
This link actually directs you a split screen: PDF on the left and meta data and other content related to the article on the right. To get just the PDF, replace pdf+html
in the url with pdf
and use session
to make another GET request.
>>> pdf_link = link.replace('pdf+html', 'pdf')
>>> pdf_request = conn.session.get(pdf_link)
Finally, to download the PDF to your local disk, save the content in pdf_request
as a binary file.
>>> fh = open('article.pdf', 'wb')
>>> fh.write(pdf_request.content)
>>> fh.close()
You can get a lot of article meta-data this way. Peruse each website's HTML to figure out exactly how to extract the information you need.
JSTOR
JSTOR is a special Raven class to connect to JSTOR's database. It comes pre-packaged with several methods for downloading article PDFs and accessing bibliographic data.
To use any of the JSTOR-specific methods, first establish a connection object using JSTOR
:
>>> from requests_raven import JSTOR
>>> conn = JSTOR(login=deets)
Download PDFs
To download the PDF of a particular article, you'll need the article's JSTOR id—i.e., everything after the work "stable" in its "stable URL". The stable URL is generally found on the first page of a PDF downloaded from JSTOR or on the top of the webpage dedicated to it. For example, the article "Superstar CEOs" in the November 2009 issue of the Quarterly Journal of Economics is http://www.jstor.org/stable/40506267, so the document id is 40506267 (see image on right).
To fetch the PDF, use the pdf
method and indicate the JSTOR document identifier using the id
keyword argument. If you supply a file name using the file
keyword argument, the PDF is automatically saved for you.
>>> pdf = conn.pdf(id=40506267, file='article.pdf')
Otherwise, you’ll need to manually save a copy of the PDF using Python's I/O functions.
>>> fh = open('article.pdf', 'wb')
>>> fh.write(pdf)
>>> fh.close()
Download bibliographic data
The ref
method returns a dictionary of bibliographic data for the document. The information it returns corresponds to whatever citation data is available via JSTOR's export citation function. Usually this includes the journal, publisher, authors, etc. Sometimes it includes the abstract. As before, use the id
keyword argument to fill in the relevant document identifier.
>>> ref = conn.ref(id=40506267)
>>> ref['authors']
{'author': ['Ulrike Malmendier', 'Geoffrey Tate']}
>>> ref['abstract']
'Compensation, status, and press coverage of managers in the United States follow a highly skewed distribution: a small number of "superstars" enjoy the bulk of the rewards. We evaluate the impact of CEOs achieving superstar status on the performance of their firms, using prestigious business awards to measure shocks to CEO status. We find that award-winning CEOs subsequently underperform, both relative to their prior performance and relative to a matched sample of non-winning CEOs. At the same time, they extract more compensation following the awards, both in absolute amounts and relative to other top executives in their firms. They also spend more time on public and private activities outside their companies, such as assuming board seats or writing books. The incidence of earnings management increases after winning awards. The effects are strongest in firms with weak corporate governance. Our results suggest that the ex post consequences of media-induced superstar status for shareholders are negative.'
The ref
method includes the optional keyword argument affiliation
. By default, affiliation
is false; when true, ref
parses the HTML on the document's JSTOR site for the universities/institutions its authors are affiliated with. The following example illustrates with ”Equilibrium Imitation and Growth” (Journal of Political Economy, 2014).
>>> ref = conn.ref(id='10.1086/674362', affiliation=True)
>>> ref['authors']
[{'affiliation': 'University of British Columbia', 'name': 'Jesse Perla'},
{'affiliation': 'Stanford University', 'name': 'Christopher Tonetti'}]
JSTOR has codified authors’ institutions for only a small number of journal articles. If no affiliation is found, just the name of the author is returned.
Download HTML
Finally, the html
method returns the raw HTML code from an article's JSTOR page. This might be useful for parsing the entire text of an article (when available) or searching for metadata not returned by ref
. Using "Equilibrium Imitation and Growth" again and BeautifulSoup, the following example returns the value of the name
attribute for the first meta
HTML tag.
>>> from bs4 import BeautifulSoup
>>> html = conn.html(id='10.1086/674362')
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.meta['name']
'robots'
EBSCOhost
Another special Raven class included in Requests-Raven is EBSCOhost
. Just like JSTOR
, EBSCOhost
is prepackaged with the pdf
, ref
and html
methods. To use it, establish a connection object, first.
>>> from requests_raven import EBSCOhost
>>> conn = EBSCOhost(login=deets)
EBSCOhost
's version ofhtml
,ref
all take the optionaldb
keyword argument, which refers to the "Database Short Name" for a given document. By default,db
is set tobtn
(Business Source Complete); to change it, see the list of other databases supported by EBSCOhost’s API.
Download PDFs
To download a PDF from EBSCOhost's database, you'll need what's called an "Accession number". It's listed at the bottom of an article's bibliographic information—highlighted in the image below for "The Strategic Bequest Motive" (Journal of Political Economy, 1985).
To grab the PDF, call the pdf
method using your connection object conn
. Save the file locally by specifying a file name with the file
keyword argument.
>>> pdf_contents = conn.pdf(id=5190842, file='article.pdf')
Download bibliographic data
Downloading bibliographic information from EBSCOhost
is done in a manner very similar to that used with JSTOR
. Simply call the ref
method on the conn
connection object.
>>> biblio = conn.ref(id=5190842)
>>> biblio['subject']
['SAVING & investment', 'INHERITANCE & succession', 'ECONOMETRICS', 'ECONOMICS', 'MATHEMATICAL economics', 'POLICYHOLDERS', "SURVIVORS' benefits", 'BENEFICIARIES']
Because authors’ affiliations are automatically returned by EBSCOhost's bibliography export function (when available),
EBSCOhost
does not include the keyword argumentaffiliation
.
Download HTML
To download the raw html of an EBSCOhost web page associated with an article, call the html
method:
>>> html_content = conn.html(id=5190842)
>>> html_content[93:121]
'The Strategic Bequest Motive'
Wiley
The final Raven
subclass is Wiley
, which accesses data and publications from Wiley Online Library. As with JSTOR
and EBSCOhost
, it contains three methods: one for downloading PDFs, another for returning bibliographic details and a final method for fetching raw HTML from an article's Wiley webpage.
As before, first establish a connection object.
>>> from requests_raven import Wiley
>>> conn = Wiley(login=deets)
Download PDFs
To download PDFs, you'll need an article’s Digital Object Identifier (DOI)—highlighted in the image on the right for the 2003 Econometrica article "Co-operation in Repeated Games When the Number of Stages is Not Commonly Known”. Once you have it, call the
pdf
method using the conn
connection object. To save it locally, specify a file name.
>>> pdf_contents = conn.pdf(id='10.1111/1468-0262.00003', file='article.pdf')
Download bibliographic data
Again, call the ref
method on the conn
connection object to fetch bibliographic information. Like JSTOR
, set the affiliation
keyword to True to return the institutions associated with each author.
>>> biblio = conn.ref(id='10.1111/1468-0262.00003', affiliation=True)
>>> biblio['authors']
[{'name': 'Neyman, Abraham', 'affiliation': 'Institute of Mathematics, The Hebrew University of Jerusalem'}]
Download HTML
To download the raw html of a web page associated with an article, call the html
method:
>>> html_content = conn.html(id='10.1111/1468-0262.00003')
>>> html_content[190:321]
'Cooperation in Repeated Games When the Number of Stages is not Commonly Known - Neyman - 2003 - Econometrica - Wiley Online Library'
Oxford Journals
Requests-Raven also contains the OxfordQJE
subclass to access the Quarterly Journal of Economics on Oxford Journals. The methods available are identical to those in JSTOR
and Wiley
. Because the search function on Oxford Journals is journal specific, OxfordQJE
doesn't apply to all of Oxford Journals. However, with slight modification, it should properly fetch PDFs and parse HTML for other journals in their database, as well. Feel free to modify the existing code to do this (and please update the code on GitHub.com!).
License
Copyright 2015 Erin Hengel
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.