By Chris Albon

Last week we published a visualization of the sentiment of crowdsourced reports submited to Ushahidi deployments. To construct the visualization, we pulled over 4000 reports from multiple Ushahidi deployments around the world with a single API request.

In this two-part tutorial, I go through step-by-step how to replicate the visualization. Part 1 focuses on pulling down the data from CrisisNET and data wrangling it using Python. Part 2 focuses on using R to run the sentiment analysis and visualize the results.

The complete code and data for this tutorial is avaliable on GitHub.

What you will need:

Part 1: Get the data with Python

To pull data down from CrisisNET and do the basic data wrangling, I use Python (specifically: iPython Notebook and Python 3).

Preliminaries

To complete the data wrangling, we are going to 1) pull data from CrisisNET's API using requests, organize it into a dataframe using pandas, and run a basic "eye-ball analysis" on the data using matplotlib. So, first let's load those modules.

# Import required modules
import requests  
import pandas as pd  
import matplotlib.pyplot as plt  

Optionally, if you are replicating this tutorial in iPython Notebook, you'll want to specify that matplotlib display all graphics inline (i.e. in the document itself).

# Make iPython Notebook display graphics 
# inside the page itself 
# (as opposed to opening a new window)
%matplotlib inline

Create an API request function

CrisisNET limits API individual requests to a few hundred observations; this is how we prevent a single users from hogging the system's processing power. However, since we want a few thousand observations, we will need to make multiple API requests and then join all of them together.

We can create a quick function in Python to do just that. I've added detailed comments to explain what each line of code does.

# Create a function called get_data,
def get_data(offset=0, limit=100, df=None):  
    # creates a request URL,
    url = api_url + '&offset=' + str(offset) + '&limit=' + str(limit)
    # makes an API request,
    r = requests.get(url, headers=headers)
    # converts the request's JSON into a pandas dataframe,
    x = pd.DataFrame(r.json())
    # expands the data field into a dataframe
    x = x['data'].apply(pd.Series)
    # adds it to the bottom of the dataframe called "df"
    df = df.append(x, ignore_index=True)

    # then, if the API request limit and offset is less than the total requested
    if total > offset + limit:
        # run the function again
        return get_data(offset + limit, limit, df)

    # if not, spit out the dataframe
    return df

In plain English, the function requests 100 observations from CrisisNET, converts the JSON into a dataframe, joins it to our main dataframe (called "df"), then pulls the next 100 observations. It repeats this process until the total number of observations pulled down matches the total we originally asked for.

Setup the API request

With the API request function constructed, we can move on to setting up the API request itself. We create three variables containing our API key (which, if you don't have, you can get here), our API url, and the request header information.

# Create a variable containing your CrisisNET API key
api_key = 'YOUR_API_KEY_HERE'

# Create a variable with your request filters (i.e. sources=ushahidi)
api_url = 'http://api.crisis.net/item/?sources=ushahidi'

# Create a variable with the request header
headers = {'Authorization': 'Bearer ' + api_key}  

It is important to note that the api_url variable contains both the path to the API endpoint and our request filters. By altering this URL, we can specify the types of data we want to recieve from CrisisNET.

In this tutorial we are interested in data from Ushahdi deployments, so we specify sources=ushahidi in the url. There are many ways to filter CrisisNET API requests, far too many to go into in this blog post. For more details visit CrisisNET's documentation. Alternatively, you can use our API explorer to quickly build a request URLs (although the explorer only contains a few possible API filters).

Make the API request

Time to make the actual API request! We do this with three lines of code. First, we specify the total number of documents (note: I use documents, reports, and observations interchangable in this tutorial) we want to recieve. Second, we create an empty pandas dataframe to store the requested documents. Finally, we run our get_data function to make the request itself.

# Set the total number of observations requested to 5000
total=5000

# Create the dataframe where all the observations
df = pd.DataFrame()

# Make the API request
df = get_data(df=df)  

Data Wrangling

Data wrangling is an informal, catch-all term for setting up the data in the way you want before you run your analysis.

Just to make sure there are no duplicates or observations without any data in them, we run two simple lines of code.

# Count the number of duplicated observations
df['id'].duplicated().value_counts()

# Drop any rows in the dataframe with no data in them
df = df.dropna(how='all')  

Since we recieved our data in JSON format, we need to "flatten" the data to have it in the right structure for our dataframe. In plain English: JSON data can have many levels, while dataframes contain only one level (i.e. dataframes are flat). The block of code below takes each part of the data that contains nested JSON objects and flattens them into the primary dataframe.

# Set the row index of the dataframe to be the time the report was updated
df["updatedAt"] = pd.to_datetime(df["updatedAt"])  
df.index = df['updatedAt']

# Expand the geo column into a full dataframe
geo_df = df['geo'].apply(pd.Series)

# Expand the address components column into it's own dataframe
geo_admin_df = geo_df['addressComponents'].apply(pd.Series)

# Join the two geo dataframes to the primary dataframe
df = pd.concat([df[:], geo_admin_df[:], geo_df[:]], axis=1)

# Extract the latitute and longitude coordinates into their own columns
df['latitude'] = df['coords'].str[1]  
df['longitude'] = df['coords'].str[0]

# Expand the tags column into its own dataframe
tags_df = df['tags'].apply(pd.Series)

# Drop everything column after the third column
tags_df = tags_df.ix[:, 0:2]  
tags_df.columns = ['tag1', 'tag2', 'tag3']

# Extract the tags
def tag_extractor(x):  
    # that, if x is a string,
    if type(x) is float:
        # just returns it untouched
        return x
    # but, if not, convert x to a dict() and return the value from the name key
    elif x:
        x = dict(x)
        return x['name']
    # and leave everything else
    else:
        return
tags_df = tags_df.applymap(tag_extractor)

# Attach the tags to the main dataframe
df = pd.concat([df[:], tags_df[:]], axis=1)

# Expand the language value:key pair
lang_df = df['language'].apply(pd.Series)

# Attach the language code as a column
df['lang'] = lang_df['code']

# print the length and view the first row
print(len(df))  

Now that we have a flat dataframe, we can run one line of code to view missing values (represented in green) for each of our variables. The more green we see, the less data that variable contains.

# Vizualize missing observations in the dataframe
df.apply(lambda x: x.isnull().value_counts()).T.plot(kind='bar', stacked=True)  

It should look something like this:

Finally, since our sentiment analysis is limited to English language reports, we create a new dataframe containing only those reports.

# create df_en, which only includes observations tagged as in English
df_en = df[df.lang == 'en']  

The final step in Python is saving the dataframe containing our English language reports to a comma seperated values (csv) document.

# Save all english language reports to a csv file
df_en.to_csv('ushahidi_world_tutorial.csv')  

And now we're ready to move onto R!

Part 2 of this tutorial will be out later this week.