Formulating batch API GET requests with Python

The first step in many linked open data and digital humanities projects is requesting data from institution’s APIs. The documentation for APIs can be extensive (New York Times) or more limited in scope, especially for projects that are not yet complete. While attempting to retrieve data from the new Cooper-Hewitt API for a school project, my teammates and I quickly realized that automated batch downloads were going to be necessary in order to grab the large amounts of records we were interested in.

Below, I’ve shared the helpful Python lesson and resulting code that we worked on in order to download numerous pages worth of requests to a single .json file, as well as how to parse the JSON so that it is human readable.


Assessing the parameters of your request
A number of variables within the following code will require an initial inspection of the GET request’s results. The documentation of the API should walk you through many of the possible arguments you can include in your request, however there are a few arguments that are vital to creating the following code. You will need to know the maximum number of results that the API will serve to you on a single page. In the case of the Brooklyn Museum API, which we will be looking at in this post, that number is 20 and it is included in the GET request with the argument page_size. The other number and argument pair that you will need is the total number of items that your request surfaces. For the Brooklyn Museum’s Textile Collection that is 1027 unique records. This number was displayed at the bottom of the returned JSON with the argument total_items.


The Complete Code Snippet

from requests import get
import json

base_url = 'http://www.brooklynmuseum.org/opencollection/api/?method=collection.search&version=1&api_key=uQS0IVg5Oh&keyword=Textile&format=json'

page_size = 20
total_items = 1027

num_pages = (total_items / page_size) + 1
print 'page_size: [%d]' % num_pages
results = []

for i in xrange(0, num_pages):
     full_url = base_url + '&results_limit=%d&start_index=%d' % (page_size, i * page_size)
     print 'fetching page [%d]: %s' % (i, full_url)
     results += get(full_url, verify=False).json()['response']['resultset']['items']

json_result = json.dumps(results, sort_keys=True, indent=4, separators=(',', ': '))

f = open('brooklyn_museum_textiles.json', 'w')
f.write(json_result)
f.close()

Calling the appropriate Python libraries

from requests import get
from math import ceil
import json

First we need to download the appropriate tools such as GET requests and json.
GET is the same as the GET requests that we could  perform manually in the browser.
JSON is necessary as we also want the script to  parse the returned JSON in a human readable format.


Defining what we know

base_url = 'http://www.brooklynmuseum.org/opencollection/api/?method=collection.search&version=1&api_key=uQS0IVg5Oh&keyword=Textile&format=json'

page_size = 20
total_items = 1027

Here we are defining a few variables that we already know. The base_url is the GET request that we utilized earlier in the browser. This request was formulated utilizing the documentation available through the API.  This request will return a dataset specified through arguments, the only issue with this request is that it will only return 20 records.

The page_size variable is used to define how many records will be requested per page. In this example we have chosen the maximum number of records, which is 20.

The total_items variable is the total amount of items available from the API for this GET request. We know the total number of items because it is stated at the end of the JSON returned from the GET request.


Defining what we don’t know

num_pages = (total_items / page_size) + 1
print 'page_size: [%d]' % num_pages
results = []

In order to find out the number of requests that we will need to make, we need to figure out how many pages there are total. We could use a calculator, but hey, the computer is a giant calculator so we’ll let it do the work, which will also allow us to utilize this same code snippet for any other GET request in this API, and with some small modifications other APIs.

Using the power of math we know that the num_pages will be the number of total_items divided by the number of records that can fit in one page of the request (page_size) +1, because the first page of requests is 0 (just like in Python!!)

The next part of this snippet is a way for us to check how our program is running, because we are going to ‘dump’ all of our JSON into a new file at the end of this script, we are going to need to check up in the Terminal to see how our code is running or otherwise we’d be left in the dark, not knowing  if and when our script crashed. In order to do this we are asking Python to kindly print a message letting us know how what the page_size is (this is our num_pages amount and the % helps to move this along).

Finally, we’ve defined results as empty, essentially so that we can use this variable for printing the results later.


Here’s where the computer does all of the work

for i in xrange(0, num_pages):
     full_url = base_url + '&results_limit=%d&start_index=%d' % (page_size, i * page_size)
     print 'fetching page [%d]: %s' % (i, full_url)
     results += get(full_url, verify=False).json()['response']['resultset']['items']

Okay so now this is where the magic is happening. This part of the Python code is a ‘for loop’ (note the indentation after the first part of this command, which begins with ‘for’). By using a for loop, we can make sure that we only perform our desired as many times as there are pages of data, otherwise the action might go on forever or we will have to deal with a bunch of error messages from the API. First we define how long the action will be performed for, for this we have created parameters () for an xrange. The parameters are from 0 to our variable, which is the number of pages (num_page).

Next up is what will be repeated as long as our for loop is running. The first thing that we want to do is to generate a URL for each page that we are requesting. We’ll define this as the variable ‘full_url.’ full_url consists of our base_url (which was defined up in the section of Things That We Know) along with a string that we are generating. The part of the URL that we need to generate includes two of the possible API queries that we can add to the string: the ‘results_limit=,’ which we’ve defined as the variable page_size, as well as the ‘start_index=’ the argument defines the page of results that we are requesting.

In order to generate this URL we need to utilize integer placeholders (%d), which we then fill/define with the parameters % (page_size, i* page size). The comma tells Python to fill the first %d with the first listed item, and the second with the second.

Next we will formulate a way to keep an eye on the process of our requests, by printing an update that displays the page that is being fetched along with the complete url (full_url)

In this next line we are making our GET request and also outlining which sets within the nested JSON we are interested in. In order to get around any issues with HTTPS, we will ask that the GET request include the full_url as well as ensure that it is not verified (verify=False).

Once we’ve received the data we want the computer to just record things that are in the ‘item’ set. The JSON that we are downloading is nested, and in order to easily parse it later and visualize it in Open Refine, we will only want the nests that include data that we are interested in. In our case we want only the ‘items’ set which is nested in the ‘resultset’ and which in turn is nested in the ‘response’ set. These variables will change depending on the API, which is why it’s a good idea to formulate an initial GET request in the browser before attempting to create a larger request using Python.


Parsing JSON, all in one line!

json_result = json.dumps(results, sort_keys=True, indent=4, separators=(',', ': '))

In order to parse and record our JSON we’ve created a variable json_result to hold the result and defined it as the printed JSON (json.dumps). In order to make this JSON parsed, we’re going to sort the results with keys and include separators. This makes are JSON pretty and human readable.


Cherry on top: Saving the results in a new file

f = open('brooklyn_museum_textiles.json', 'w')
f.write(json_result)
f.close()

We could just print out the results in the Terminal, but as we are requesting a super massive amount of data there is a possibility then that our results would be truncated, as the Terminal will only display a certain amount of characters. Additionally, we want to plug this data into Open Refine, which will require it being in a .json file. With this last little snippet we are asking Python to open a new file called ‘brooklyn_museum_textiles.json’ and then we are letting it know that we are using this for writing purposes, any other file with this name will be written over.

With ‘f.write(json_result)’ we are asking the computer to write the variable json_result into our new file. Finally we are asking for the file to be closed.

Now we can go to the desktop and open our JSON file!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: