By Chris Albon

CrisisNET has hundreds of thousands of crisis-relevant data points ready to be pulled down by users. However, as with most APIs, an individual request is limited to 500 data points so as not to overload our system. How do you get your hands on more than 500 data points? By using the API's offset filter.

The offset filter allows you to skip a specified number of data points. For example, offset=500 would skip the first 500 data points (thereby requesting the 501st data point). By making multiple pulls, each time using offset to skip over data points we've already grabbed, we can pull down large amounts of data from CrisisNET.

In the tutorial below, we provide a simple Python script for making these large requests. The full iPython notebook is on Github.

Preliminaries

First, we load the required modules. In this case, we are using pandas for data management, requests to make the actual pull, and numpy to deal with missing observations.

# Import modules
import pandas as pd  
import requests  
import numpy as np  

With that done, we are ready to setup the API request. To do this, we will define our API key, the data we wish to request from CrisisNET, how many data points we want, and finally create a pandas dataframe where we will store the data.

Note that the request's URL (defined below as the api_url variable) tells CrisisNET what data we want to receive using the API's filtering system. In this simple example, we are request data from the Violation Documentation Center in Syria.

# Insert your CrisisNET API key
api_key = 'YOUR_CRISISNET_API_KEY'

# Insert your CrisisNET request API
api_url = 'http://api.crisis.net/item/?sources=vdc_syria'

# Create the request header
headers = {'Authorization': 'Bearer ' + api_key}

# Define how many data points you want
total = 3200

# Create a dataframe where the request data will go
df = pd.DataFrame()  

Get the data

With the initial setup complete, we can create a function to make the API requests. I've added comments explaining each line of code, but put simply, the function makes a API request for 100 data points, adds them to the dataframe we created above, and repeats until the number of data points received equals the total number requested.

# Define a function called get data,
def get_data(offset=0, limit=100, df=None):  
    # create a variable called url, which has the request info,
    url = api_url + '&offset=' + str(offset) + '&limit=' + str(limit)
    # a variable called r, with the request data,
    r = requests.get(url, headers=headers)
    # convert the request data into a dataframe,
    x = pd.DataFrame(r.json())
    # expand the dataframe
    x = x['data'].apply(pd.Series)
    # add the dataframe's rows to the main dataframe, df, we defined outside the function
    df = df.append(x, ignore_index=True)

    # then, if the total is larger than the request limit plus offset,
    if total > offset + limit:
        # run the function another time
        return get_data(offset + limit, limit, df)
    # but if not, end the function
    return df

Let's run the function and get some data!

# Run the function
df = get_data(df=df)  

Check the data

Once we have received the data points, the first thing we want to do is check to see how many we have. We do that by looking at the length (i.e. number of rows) of the dataframe.

# Check the number of data points retrieved
len(df)  

Next, we check for duplicate data points and drop them.

# Check for duplicate data points
df['id'].duplicated().value_counts()

# Drop all duplicate data points
df = df.dropna(how='all')  

As a final check, let's look at the first and last few rows of the data.

# View the first 10 data points
df.head()

# View the first 10 data points
df.tail()  

And there we have it. You can either work on the data in Python directly, or export the data points to another tool as a csv file.

Good luck!