By Chris Albon

CrisisNET provides a simple, powerful resource for accessing crisis-relevant data. This is a short introduction to pulling down data from CrisisNET and completing some simple data wrangling tasks using Python's pandas module — setting you up to start pulling and using data on your own.

Note: The example below uses Python 3+ and iPython Notebook. The notebook is avaliable at nbviewer.

Initial Setup

First, we load two modules, requests to make the api pull, and pandas to do the data wrangling.

# import required modules
import requests as re  
import pandas as pd  

Optional: I am writing this in iPython Notebook, so I'm also going to set the maximum number of columns displayed to 30. If you are not using iPython, there is no need to run this.

# Set pandas to display a maximum of 30 columns
pd.set_option('display.max_columns', 30)  

Note: If you are not using iPython you will need to wrap a few lines of code below in a print() function.

Pulling Data from CrisisNET

Next, create a variable with your CrisisNET API key and setup the information you'll need to make the API request. If you don't have one yet, you can get an API key here.

# Create a variable with your CrisisNET API key
api_key = 'YOUR_API_KEY'

# Setup the request header
headers = {'Authorization': 'Bearer ' + api_key}  

Data requested from CrisisNET can be filtered in a wide range of ways, from location to date period.

In the code below, I've created a quick and dirty method for setting up a few the the avaliable filters. Specifically, I use Python's string formatting methods to create the API request URL. As it appears below, I am not filtering on anything, however if we wanted to filter a request, we would only need to add values after the equals sign. For example, to filter by data tagged as violence we could change tags= to tags=violence.

The example below only includes a portion of the avaliable ways to filter API requests. For a complete list, visit CrisisNET's documentation.

# Setup the request's URL
url = 'http://devapi.crisis.net/item?%(tags)s&%(before)s&%(after)s&%(text)s&%(location)s&%(radius)s&%(limit)s&%(sources)s&%(licenses)s&%(offset)s'

# Create a list of filters
    filters =   {
        'tags' : 'tags=', 
        'after' : 'after=', 
        'before' : 'before=', 
        'text' : 'text=', 
        'location' : 'location=', 
        'radius' : 'radius=', 
        'limit' : 'limit=', 
        'sources' : 'sources=', 
        'licenses' : 'licenses=', 
        'offset' : 'offset='}

# Create the formatted request URL
formattedURL = url % filters  

With the request URL constructed, we are ready to pull data down from CrisisNET using the requests module.

# Request data from CrisisNET
r = re.get(formattedURL, headers=headers)

# Check to make sure the pull was successful
# If successful, we will see "Response 200"
print(r)  

Unpacking with pandas

CrisisNET provides data in a JSON format. This is great, because it means we can use a simple functionality of pandas to convert it into a dataframe.

# Create a dataframe from the request's json format
request_df = pd.DataFrame(r.json())  

Awesome, right? Let's take a look at the first observations in the data.

# View the first five rows of the request dataframe
request_df.head(1)  

The output might seem a little strange at this point. The reason is that JSON can have a nested data structuret. We have all the data, but we need to do a little more work to convert it into a pretty dataframe.

To dig into the JSON structure, we can can apply the pandas Series function.

# Create a dataframe from the request's data
df = request_df['data'].apply(pd.Series)  

Let's take a look at what we have done.

### View the first row of the dataframe
df.head(1)  

Amazing, right? We are starting to see the data nested in the JSON format. However, before we dig into the data we need to check that we have sufficient observersation.

# Check the length of the dataframe
len(df)  

With that test passed, we can start to structure our dataframe by indexing the rows of the dataframe by time.

# Set the row index of the dataframe to be the time the report was updated
df["updatedAt"] = pd.to_datetime(df["updatedAt"])  
df.index = df['updatedAt']  

We can now move on to the more important steps of converting the JSON CrisisNET output into a flat dataframe.

First, we will expand the 'geo' JSON object into a pandas dataframe.

# Expand the geo column into a full dataframe
geo_df = df['geo'].apply(pd.Series)  
geo_df.head(1)  

Let's also expand the addressComponents JSON object into a pandas object.

# Expand the address components column into it's own dataframe
geo_admin_df = geo_df['addressComponents'].apply(pd.Series)  
geo_admin_df.head(1)  

With both of those sections converted into pandas objects, we can merge them back into the main dataframe.

# Join the two geo dataframes to the primary dataframe
df = pd.concat([df[:], geo_admin_df[:], geo_df[:]], axis=1)  

While not always necessary, sometimes it is useful to have the latitutde and longitude coordinates in seperate columns. To do this we just need to select the first and second object of the df['coords'] column.

# Extract the latitute and longitude coordinates into their own columns
df['latitude'] = df['coords'].str[1]  
df['longitude'] = df['coords'].str[0]  

With the geographic information complete, it is time to move onto the tags section. In addition to each item's original tags, CrisisNET assigns system tags where possible, allowing users to find similar content across different sources. This is a more advanced topic we will address in a future tutorial, for now, we will consider all tags the same.

Just like with the geograghic JSON object, we can convert the tags object into a pandas object.

# Expand the tags column into its own pandas object
tags_df = df['tags'].apply(pd.Series)  

Now each tag has it's own column. Some items might have many tags while others only have a few. To simplify things, let's only take the first two tags of each item. This is person preference, of course.

# Drop everything column after the second
tags_df = tags_df.ix[:, 0:1]

# Add titles to the columns
tags_df.columns = ['tag1', 'tag2']  
tags_df.columns

# View the first few rows of the tags dataframe
tags_df.head(1)  

Getting the tags out of the JSON format will require a creating a simple little function. The function will work it's way through each cell in the tags dataframe and extracts the value of the tag name key-value pair.

# Create a function called tag_extractor,
def tag_extractor(x):  
    # that, if x is a string,
    if type(x) is float:
        # just returns it untouched
        return x
    # but, if not,
    elif x:
        # converts x to a dict(),
        x = dict(x)
        # and returns the value from the name key
        return x['name']
    # and leaves everything else
    else:
        return

# Apply the function to every cell in the dataframe
tags_df = tags_df.applymap(tag_extractor)  

And then we can join the newly created tag dataframe to the main dataframe.

# Join the tags dataframe with the primary dataframe
df = pd.concat([df[:], tags_df[:]], axis=1)  

Third, let's do the same action to the language column, expanding it, taking the language code, and joining it back into the main dataframe.

# Expand the language column into it's own dataframe and return the language code column to the original dataframe
lang_df = df['language'].apply(pd.Series)  
df['lang'] = lang_df['code']  

This last step is optional, but I find it is helpful, specifically, dropping columns we definitely will not be using.

# Drop some extra columns to clean up the dataframe
df = df.drop(['geo', 'updatedAt', 'addressComponents', 'language', 'tags', 'coords', 'id', 'remoteID', ], axis=1)  

Finally, our dataframe is ready to use.

# View the first few rows of the final dataframe
df.head()  

And that's it! Good luck and have fun.