Analyzing Police Activity with Pandas
Exploring the Stanford Open Policing Project dataset and analyzing the impact of gender on police behavior.
- Preparing the data for analysis
- Exploring the relationship between gender and policing
- Visual exploratory data analysis
- Analyzing the effect of weather on policing
Preparing the data for analysis
Before beginning our analysis, it is critical that we first examine and clean the dataset, to make working with it a more efficient process. We will fixing data types, handle missing values, and dropping columns and rows while exploring the Stanford Open Policing Project dataset.
Stanford Open Policing Project dataset
Examining the dataset
We'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.
Before beginning our analysis, it's important that we familiarize yourself with the dataset. We read the dataset into pandas, examine the first few rows, and then count the number of missing values.
import pandas as pd
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype
# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv("../datasets/police.csv")
# Examine the head of the DataFrame
display(ri.head())
# Count the number of missing values in each column
ri.isnull().sum()
It looks like most of the columns have at least some missing values. We'll figure out how to handle these values in the next.
Dropping columns
We'll drop the county_name
column because it only contains missing values, and we'll drop the state
column because all of the traffic stops took place in one state (Rhode Island).
# Examine the shape of the DataFrame
print(ri.shape)
# Drop the 'county_name' and 'state' columns
ri.drop(["county_name", "state"], axis='columns', inplace=True)
# Examine the shape of the DataFrame (again)
print(ri.shape)
# Count the number of missing values in each column
display(ri.isnull().sum())
# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=["driver_gender"], inplace=True)
# Count the number of missing values in each column (again)
display(ri.isnull().sum())
# Examine the shape of the DataFrame
ri.shape
ri.dtypes
-
stop_date
: should be datetime -
stop_time
: should be datetime -
driver_gender
: should be category -
driver_race
: should be category -
violation_raw
: should be category -
violation
: should be category -
district
: should be category -
is_arrested
: should be bool
We'll fix the data type of the is_arrested
column
# Examine the head of the 'is_arrested' column
display(ri.is_arrested.head())
# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')
# Check the data type of 'is_arrested'
ri.is_arrested.dtype
ri['stop_date_time'] = pd.to_datetime(ri.stop_date.str.replace("/", "-").str.cat(ri.stop_time, sep=" "))
ri.dtypes
# Set 'stop_datetime' as the index
ri.set_index("stop_date_time", inplace=True)
# Examine the index
display(ri.index)
# Examine the columns
ri.columns
Exploring the relationship between gender and policing
Does the gender of a driver have an impact on police behavior during a traffic stop? We will explore that question while doing filtering, grouping, method chaining, Boolean math, string methods, and more!
Do the genders commit different violations?
Examining traffic violations
Before comparing the violations being committed by each gender, we should examine the violations committed by all drivers to get a baseline understanding of the data.
We'll count the unique values in the violation
column, and then separately express those counts as proportions.
# Count the unique values in 'violation'
display(ri.violation.value_counts())
# Express the counts as proportions
ri.violation.value_counts(normalize=True)
More than half of all violations are for speeding, followed by other moving violations and equipment violations.
Comparing violations by gender
The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.
We'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.
# Create a DataFrame of female drivers
female = ri[ri.driver_gender=="F"]
# Create a DataFrame of male drivers
male = ri[ri.driver_gender=="M"]
# Compute the violations by female drivers (as proportions)
display(female.violation.value_counts(normalize=True))
# Compute the violations by male drivers (as proportions)
male.violation.value_counts(normalize=True)
About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories. This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers.
Does gender affect who gets a ticket for speeding?
Comparing speeding outcomes by gender
When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can we find evidence of this in the dataset?
First, we'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.
Then, for each gender, we'll use the stop_outcome
column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender=="F") & (ri.violation =="Speeding")]
# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender=="M") & (ri.violation =="Speeding")]
# Compute the stop outcomes for female drivers (as proportions)
display(female_and_speeding.stop_outcome.value_counts(normalize=True))
# Compute the stop outcomes for male drivers (as proportions)
male_and_speeding.stop_outcome.value_counts(normalize=True)
The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. Thus, the data fails to show that gender has an impact on who gets a ticket for speeding.
Does gender affect whose vehicle is searched?
Calculating the search rate
During a traffic stop, the police officer sometimes conducts a search of the vehicle. We'll calculate the percentage of all stops in the ri
DataFrame that result in a vehicle search, also known as the search rate.
# Check the data type of 'search_conducted'
print(ri.search_conducted.dtype)
# Calculate the search rate by counting the values
display(ri.search_conducted.value_counts(normalize=True))
# Calculate the search rate by taking the mean
ri.search_conducted.mean()
It looks like the search rate is about 3.8%. Next, we'll examine whether the search rate varies by driver gender.
Comparing search rates by gender
We'll compare the rates at which female and male drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about 3.8%.
First, we'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, we'll perform the same calculation for both genders at once using a .groupby()
.
ri[ri.driver_gender=="F"].search_conducted.mean()
ri[ri.driver_gender=="M"].search_conducted.mean()
ri.groupby("driver_gender").search_conducted.mean()
Male drivers are searched more than twice as often as female drivers. Why might this be?
Adding a second factor to the analysis
Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.
For example, we might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.
we can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, out would find that males and females are searched at about the same rate for each violation. Let's find out below if that's the case!
# Calculate the search rate for each combination of gender and violation
ri.groupby(["driver_gender", "violation"]).search_conducted.mean()
ri.groupby(["violation", "driver_gender"]).search_conducted.mean()
For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.
Does gender affect who is frisked during a search?
Counting protective frisks
During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."
We'll first check to see how many times "Protective Frisk" was the only search type. Then, we'll use a string method to locate all instances in which the driver was frisked.
# Count the 'search_type' values
display(ri.search_type.value_counts())
# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)
# Check the data type of 'frisk'
print(ri.frisk.dtype)
# Take the sum of 'frisk'
print(ri.frisk.sum())
It looks like there were 303 drivers who were frisked. Next, we'll examine whether gender affects who is frisked.
Comparing frisk rates by gender
We'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?
Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]
# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())
# Calculate the frisk rate for each gender
searched.groupby("driver_gender").frisk.mean()
The frisk rate is higher for males than for females, though we can't conclude that this difference is caused by the driver's gender.
Visual exploratory data analysis
Are you more likely to get arrested at a certain time of day? Are drug-related stops on the rise? We will answer these and other questions by analyzing the dataset visually, since plots can help us to understand trends in a way that examining the raw data cannot.
Does time of the day affect arrest rate?
Calculating the hourly arrest rate
When a police officer stops a driver, a small percentage of those stops ends in an arrest. This is known as the arrest rate. We'll find out whether the arrest rate varies by time of day.
First, we'll calculate the arrest rate across all stops in the ri
DataFrame. Then, we'll calculate the hourly arrest rate by using the hour attribute of the index. The hour ranges from 0 to 23, in which:
- 0 = midnight
- 12 = noon
- 23 = 11 PM
# Calculate the overall arrest rate
print(ri.is_arrested.mean())
# Calculate the hourly arrest rate
# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
hourly_arrest_rate
Next we'll plot the data so that you can visually examine the arrest rate trends.
Plotting the hourly arrest rate
We'll create a line plot from the hourly_arrest_rate
object.
# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()
# Add the xlabel, ylabel, and title
plt.xlabel("Hour")
plt.ylabel("Arrest Rate")
plt.title("Arrest Rate by Time of Day")
# Display the plot
plt.show()
The arrest rate has a significant spike overnight, and then dips in the early morning hours.
Are drug-related stops on the rise?
Plotting drug-related stops
In a small portion of traffic stops, drugs are found in the vehicle during a search. In this exercise, you'll assess whether these drug-related stops are becoming more common over time.
The Boolean column drugs_related_stop
indicates whether drugs were found during a given stop. We'll calculate the annual drug rate by resampling this column, and then we'll use a line plot to visualize how the rate has changed over time.
# Calculate the annual rate of drug-related stops
# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample("A").mean()
display(annual_drug_rate)
# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()
# Display the plot
plt.show()
The rate of drug-related stops nearly doubled over the course of 10 years. Why might that be the case?
Comparing drug and search rates
The rate of drug-related stops increased significantly between 2005 and 2015. We might hypothesize that the rate of vehicle searches was also increasing, which would have led to an increase in drug-related stops even if more drivers were not carrying drugs.
We can test this hypothesis by calculating the annual search rate, and then plotting it against the annual drug rate. If the hypothesis is true, then we'll see both rates increasing over time.
# Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample("A").mean()
# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate, annual_search_rate], axis="columns")
# Create subplots from 'annual'
annual.plot(subplots=True)
# Display the subplots
plt.show()
The rate of drug-related stops increased even though the search rate decreased, disproving our hypothesis.
What violations are caught in each district?
Tallying violations by district
The state of Rhode Island is broken into six police districts, also known as zones. How do the zones compare in terms of what violations are caught by police?
We'll create a frequency table to determine how many violations of each type took place in each of the six zones. Then, we'll filter the table to focus on the "K" zones, which we'll examine further.
# Create a frequency table of districts and violations
# Save the frequency table as 'all_zones'
all_zones = pd.crosstab(ri.district, ri.violation)
display(all_zones)
# Select rows 'Zone K1' through 'Zone K3'
# Save the smaller table as 'k_zones'
k_zones = all_zones.loc["Zone K1":"Zone K3"]
k_zones
We'll plot the violations so that you can compare these districts.
Plotting violations by district
Now that we've created a frequency table focused on the "K" zones, we'll visualize the data to help us compare what violations are being caught in each zone.
First we'll create a bar plot, which is an appropriate plot type since we're comparing categorical data. Then we'll create a stacked bar plot in order to get a slightly different look at the data.
# Create a bar plot of 'k_zones'
k_zones.plot(kind="bar")
# Display the plot
plt.show()
# Create a stacked bar plot of 'k_zones'
k_zones.plot(kind="bar", stacked=True)
# Display the plot
plt.show()
The vast majority of traffic stops in Zone K1 are for speeding, and Zones K2 and K3 are remarkably similar to one another in terms of violations.
How long might you be stopped for a violation?
Converting stop durations to numbers
In the traffic stops dataset, the stop_duration
column tells us approximately how long the driver was detained by the officer. Unfortunately, the durations are stored as strings, such as '0-15 Min'
. How can we make this data easier to analyze?
We'll convert the stop durations to integers. Because the precise durations are not available, we'll have to estimate the numbers using reasonable values:
- Convert
'0-15 Min'
to8
- Convert
'16-30 Min'
to23
- Convert
'30+ Min'
to45
# Create a dictionary that maps strings to integers
mapping = {"0-15 Min":8, '16-30 Min':23, '30+ Min':45}
# Convert the 'stop_duration' strings to integers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)
# Print the unique values in 'stop_minutes'
ri.stop_minutes.unique()
Next we'll analyze the stop length for each type of violation.
Plotting stop length
If you were stopped for a particular violation, how long might you expect to be detained?
We'll visualize the average length of time drivers are stopped for each type of violation. Rather than using the violation
column we'll use violation_raw
since it contains more detailed descriptions of the violations.
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
# Save the resulting Series as 'stop_length'
stop_length = ri.groupby("violation_raw").stop_minutes.mean()
display(stop_length)
# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind="barh")
# Display the plot
plt.show()
Analyzing the effect of weather on policing
We will use a second dataset to explore the impact of weather conditions on police behavior during traffic stops. We will be merging and reshaping datasets, assessing whether a data source is trustworthy, working with categorical data, and other advanced skills.
Exploring the weather dataset
Plotting the temperature
We'll examine the temperature columns from the weather dataset to assess whether the data seems trustworthy. First we'll print the summary statistics, and then you'll visualize the data using a box plot.
# Read 'weather.csv' into a DataFrame named 'weather'
weather = pd.read_csv("../datasets/weather.csv")
display(weather.head())
# Describe the temperature columns
display(weather[["TMIN", "TAVG", "TMAX"]].describe().T)
# Create a box plot of the temperature columns
weather[["TMIN", "TAVG", "TMAX"]].plot(kind='box')
# Display the plot
plt.show()
The temperature data looks good so far: the TAVG
values are in between TMIN
and TMAX
, and the measurements and ranges seem reasonable.
Plotting the temperature difference
We'll continue to assess whether the dataset seems trustworthy by plotting the difference between the maximum and minimum temperatures.
# Create a 'TDIFF' column that represents temperature difference
weather["TDIFF"] = weather.TMAX - weather.TMIN
# Describe the 'TDIFF' column
display(weather.TDIFF.describe())
# Create a histogram with 20 bins to visualize 'TDIFF'
weather.TDIFF.plot(kind="hist", bins=20)
# Display the plot
plt.show()
The TDIFF
column has no negative values and its distribution is approximately normal, both of which are signs that the data is trustworthy.
Categorizing the weather
Counting bad weather conditions
The weather
DataFrame contains 20 columns that start with 'WT'
, each of which represents a bad weather condition. For example:
-
WT05
indicates "Hail" -
WT11
indicates "High or damaging winds" -
WT17
indicates "Freezing rain"
For every row in the dataset, each WT
column contains either a 1
(meaning the condition was present that day) or NaN
(meaning the condition was not present).
We'll quantify "how bad" the weather was each day by counting the number of 1 values in each row.
# Copy 'WT01' through 'WT22' to a new DataFrame
WT = weather.loc[:, "WT01":"WT22"]
# Calculate the sum of each row in 'WT'
weather['bad_conditions'] = WT.sum(axis="columns")
# Replace missing values in 'bad_conditions' with '0'
weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
# Create a histogram to visualize 'bad_conditions'
weather.bad_conditions.plot(kind="hist")
# Display the plot
plt.show()
It looks like many days didn't have any bad weather conditions, and only a small portion of days had more than four bad weather conditions.
Rating the weather conditions
We counted the number of bad weather conditions each day. We'll use the counts to create a rating system for the weather.
The counts range from 0 to 9, and should be converted to ratings as follows:
- Convert 0 to 'good'
- Convert 1 through 4 to 'bad'
- Convert 5 through 9 to 'worse'
# Count the unique values in 'bad_conditions' and sort the index
display(weather.bad_conditions.value_counts().sort_index())
# Create a dictionary that maps integers to strings
mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', 7:'worse', 8:'worse', 9:'worse'}
# Convert the 'bad_conditions' integers to strings using the 'mapping'
weather['rating'] = weather.bad_conditions.map(mapping)
# Count the unique values in 'rating'
weather.rating.value_counts()
# Create a list of weather ratings in logical order
cats = ['good', 'bad', 'worse']
# Change the data type of 'rating' to category
weather['rating'] = weather.rating.astype(CategoricalDtype(ordered=True, categories=cats))
# Examine the head of 'rating'
weather.rating.head()
We'll use the rating column in future exercises to analyze the effects of weather on police behavior.
Merging datasets
Preparing the DataFrames
We'll prepare the traffic stop and weather rating DataFrames so that they're ready to be merged:
- With the
ri
DataFrame, we'll move thestop_datetime
index to a column since the index will be lost during the merge. - With the weather DataFrame, we'll select the
DATE
andrating
columns and put them in a new DataFrame.
# Reset the index of 'ri'
ri.reset_index(inplace=True)
# Examine the head of 'ri'
display(ri.head())
# Create a DataFrame from the 'DATE' and 'rating' columns
weather_rating = weather[["DATE", "rating"]]
# Examine the head of 'weather_rating'
weather_rating.head()
The ri
and weather_rating
DataFrames are now ready to be merged.
Merging the DataFrames
We'll merge the ri
and weather_rating
DataFrames into a new DataFrame, ri_weather
.
The DataFrames will be joined using the stop_date
column from ri
and the DATE
column from weather_rating
. Thankfully the date formatting matches exactly, which is not always the case!
Once the merge is complete, we'll set stop_datetime
as the index
# Examine the shape of 'ri'
print(ri.shape)
# Merge 'ri' and 'weather_rating' using a left join
ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left')
# Examine the shape of 'ri_weather'
print(ri_weather.shape)
# Set 'stop_datetime' as the index of 'ri_weather'
ri_weather.set_index('stop_date_time', inplace=True)
ri_weather.head()
We'll use ri_weather to analyze the relationship between weather conditions and police behavior.
Does weather affect the arrest rate?
Comparing arrest rates by weather rating
Do police officers arrest drivers more often when the weather is bad? Let's find out below!
- First, we'll calculate the overall arrest rate.
- Then, we'll calculate the arrest rate for each of the weather ratings we previously assigned.
- Finally, we'll add violation type as a second factor in the analysis, to see if that accounts for any differences in the arrest rate.
Since we previously defined a logical order for the weather categories, good < bad < worse, they will be sorted that way in the results.
# Calculate the overall arrest rate
print(ri_weather.is_arrested.mean())
# Calculate the arrest rate for each 'rating'
ri_weather.groupby("rating").is_arrested.mean()
# Calculate the arrest rate for each 'violation' and 'rating'
ri_weather.groupby(["violation", 'rating']).is_arrested.mean()
The arrest rate increases as the weather gets worse, and that trend persists across many of the violation types. This doesn't prove a causal link, but it's quite an interesting result!
Selecting from a multi-indexed Series
The output of a single .groupby()
operation on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:
- The outer index level is like the DataFrame rows.
- The inner index level is like the DataFrame columns.
# Save the output of the groupby operation from the last exercise
arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
# Print the arrest rate for moving violations in bad weather
display(arrest_rate.loc["Moving violation", "bad"])
# Print the arrest rates for speeding violations in all three weather conditions
arrest_rate.loc["Speeding"]
Reshaping the arrest rate data
We'll start by reshaping the arrest_rate Series into a DataFrame. This is a useful step when working with any multi-indexed Series, since it enables you to access the full range of DataFrame methods.
Then, we'll create the exact same DataFrame using a pivot table. This is a great example of how pandas often gives you more than one way to reach the same result!
# Unstack the 'arrest_rate' Series into a DataFrame
display(arrest_rate.unstack())
# Create the same DataFrame using a pivot table
ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested')