A Cautionary Tale About Time of Day Analysis Using Los Angeles Crime Data

The Story

Imagine that you’ve been tasked with strategically allocating scarce police resources across the City of Los Angeles. You, understanding the value of data, decide to do some historical analysis of the data that the City of L.A. has kindly made available via the LA Open Data Portal. With this data in hand, you feel much more confident in being able to be better informed to predict when and where crimes might occur and how to more efficiently utilize crime-fighting police forces. Let’s see the data!

Introducing the Data

We have an entire collection of crime data from Los Angeles from the years 2012 to the first quarter of 2016, inclusively. The data was downloaded as five separate csv files from data.lacity.org in the summer of 2016. Some minimal processing was done to combine the raw data and generate a new dataset with a few additional features for convenience. The data is available as a zipped file.

Altogether, we have 1,018,653 different crimes in the dataset, and include the following features (some derived):

variable name type description
year_id character original dataset id
date_rptd date date crime occurred
dr_no character ?
date_occ date date crime occurred
area_name character geographical location
rd character nearby road identifier
crm_cd_desc character crime type
status_desc character status outcome of crime
location character nearby address location
cross_st character nearby cross street
lat numeric latitude
long numeric longitute
year numeric year of crime occurred
month numeric month of crime occurred
day_of_month numeric day of month of crime occurred
hour_of_day numeric hour of day of crime occurred
day_of_week character day of week of crime occurred
weekday character weeday/weekend classification
simple_crime_bin character subjective binning of crimes

Total Crime Aggregations

A natural first question to ask of the data is “when do the majority of crimes occur during the hours of the day?”. Let’s keep things fairly simple to start with. By aggregating all the data into an average crime rate for each hour of the day, we see an interesting peak at 12:00 PM for each year in the dataset.


Well here’s some unexpected insight! If the hour between noon and 1 PM really is the busiest time for criminals than we should surely act on this information! Inform the police of this interesting finding; transfer crime-fighting resources from evening hours to the lunch time hours. But wait? Is this a real insight gleaned from the data, or is it simply an artifact of how the data was collected? Let’s dig deeper and check which types of crimes contributed to this peak.

Which Crimes Cause These Lunch Time Crime Peaks?

Let’s take a closer look and split by different types of crimes. To start, we’ll classify all 104 unique crime types into 13 simple_crime_bins (this was a highly subjective process!).


Obviously, this noon peak isn’t shared amongst all crimes. In fact, it appears that only fraud, other, sexual, and theft crime types have this lunch time peak. Could these types of crimes’ peaks be attributed to a lesser chance of the victims know when the crimes occurred?

Investigating these Crimes Further

Filtering to the FRAUD, OTHER, SEXUAL, and THEFT crimes to analyze and holding the y scales constant shows that THEFT as being the main culprit for noon peak in number of crimes.


It’s interesting and suspicious that the sudden, narrow peak is very outside of the gradual rise in the THEFT line. This suggests that the crime peak might not be real. This means that 12:00 PM would likely be the default time of day when a time of crime occurrence is unknown, especially given that thefts usually occur when the victim isn’t present. But to make sure, let’s check geographically.

Is Theft Centered at only one Area?


Since this peak is fairly prevalent geographically, it seems more likely the peak is in fact due to how the data is collected and recorded and not from a true peak in criminal activity during lunch time.

Why is This Important?

In my experience, some of the biggest challenges in working with real datasets tend to deal with data that isn’t there, and data that can’t necessarily be trusted as is. In this case, the LA crimes dataset presents the second challenge, and this shouldn’t be taken lightly. Only using crime data to better understand when and how we should staff thin police resources probably wouldn’t be that smart. This also highlights the importance of intuition and domain expertise in any data analysis.


Testing for Racial Profiling in the LAPD using Open 2015 Police Stop Data

Racial tensions across the nation have been mounting across all channels. Police shootings concerning racial minorities has captured the nation’s attention. The Pew Research Center reports that for the upcoming 2016 presidential election that “as many as (63%) say the issue of how racial and ethnic minorities are treated will be very important to their vote.” The Black Lives Matter group has become a household name.

Police Racial Headlines

Obviously, these issues are generating a lot of media hype and police criticism. We will do our best to measure and test for racial profiling in the vehicle stops made in the city of Los Angeles in the year 2015. Fortunately, the LA Open Data Initiative has made this possible with great APIs and data downloads, although the LA 2015 Stop Data lacks substantial documentation. You can explore more of the open data sets at data.lacity.org.

To assess LAPD for racial profiling, we’ll assess each of the 25 major police groupings or divisions that fall under the four major bureaus for the city of LA. Twenty-one of these divisions are represented by actual police stations (and their corresponding geographical areas), while the other four represent traffic divisions–divisions that mostly handle duties like investigating traffic accidents and citation issuing. The basic breakdown can be seen below.

Central Bureau South Bureau Valley Bureau West Bureau
Central Area 77th Street Area Devonshire Area Hollywood Area
Hollenbeck Area Harbor Area Foothill Area Olympic Area
Newton Area Southeast Area Mission Area Pacific Area
Northeast Area Southwest Area North Hollywood Area West Los Angeles Area
Rampart Area Van Nuys Area Wilshire Area
West Valley Area
Topanga Area
Central Traffic South Traffic Valley Traffic West Traffic

We can get a general sense of the size of the divisions thanks to LA Times. Here’s a snapshot.

LAPD Divisions

Introducing the Data

Overall, the data shows that there were a combined vehicle and pedestrian 620,315 stops made between Jan 1, 2015 and Dec 31, 2015. We can see that overall breakdown as follows.

overall stops

This picture doesn’t suggest much–we know that Los Angeles is inhabited predominantly by Hispanics, Blacks, and Whites. In fact, a small table from the 2015 census data shows

Population Type (single race only) Population % of LA County
Hispanic or Latino 48.4%
White alone, not Hispanic or Latino 26.6
Asian alone 15.0%
Black or Afriancan American alone 9.1%
American Indian and Alaska Native alone 1.5%

Breaking down these race proportions by division gives us

division breakdown

Testing Methods

What we need is some type of benchmark with which to test against. Let’s illustrate this with a made up example city. Say this city’s population is 75% Hispanic. Then, assuming this city has a non-discriminating police force, we shouldn’t be surprised when somewhere near 75% of the stops made involve a Hispanic person. This reasoning suggests to controlling stop data with population data, often using the United States Census Data to do so. This is one method in which to assess police discrimination by race, but it’s not perfect.

However, this method has its own problems. We know the majority of police stops involve vehicles (in fact, 72% in our real LA 2015 data set), and a few hypothetical reasons that our example city Hispanics might not be stopped as often as other races in this city to make up the example 75% population number might include:

  • Hispanics don’t drive as far
  • Hispanics drive less often
  • Hispanics tend to exhibit different driving patterns

Another method we might try is issuing traffic surveys to better understand how different races typically spend time on the roads. However, this would seem extremely costly and we would need to think carefully to avoid selection bias.

To avoid the pitfalls of the above methods, let us introduce the Veil of Darkness Hypothesis. Instead of using population or survey data as controls, we will instead compare the difference between the amount of stops made in daylight vs stops made at night. Then, if we see that a significantly greater proportion of a races’ stops are made during daylight than at night, we would hypothesize that the police are unfairly targeting that racial group because they stop them more often when they can see the race of the driver than at night when visibly identifying the race of a driver would be much more challenging.

More on the Veil of Darkness

The Veil of Darkness Hypothesis was first explored by Jeffrey Grogger and Greg Ridgeway in 2006 where they utilized Oaklands traffic stop data to yield little evidence of police racial profiling against black drivers. You can check out the paper at Testing for Racial Profiling in Traffic Stops From Behind A Veil of Darkness. Since then, others have used the Veil of Darkness methodology to produce other publications and reports including:

This method require certain assumptions. Between daylight and night, we require that:

  1. Traffic patterns
  2. Driving behavior
  3. Exposure to law enforcement

all remain constant for each race.  Since this is unlikely to be true, we will adhere to the assumptions by using police stop data made in the inter-twilight time.

Inter-Twilight Time

Generally speaking,  it is unlikely that people of different races spend the same amount of time on the road.  To follow the assumptions for the Veil-of-Darkness methologoy, we will filter out most of our stops and only test on races that fall between 4:43 PM and 8:10PM, the earliest and latest sundown clock times of 2015, respectively.

Within this 207 minute interval, we classify a vehicle stop as one made before or after sundown time, or as driver visible/driver not visible.

Our Analysis Process

To carry out the analysis, we first prepare the data. For each stop, we

  • classify it as in/out of the inter-twilight period
  • classify it as daylight/dark (proxy for visible/not visible)

Then, we filter our stop data to only:

  1. the aforementioned 25 divisions
  2. White, Black, and Hispanic stops
  3. inter-twilight period stops
  4. vehicle type stops
  5. weekday stops

Overall, this leaves us with only 43,652 stops out of the original 620,315 (~ 7%).


For each of the three Hispanic, Black, and White races (we lacked sufficient amount of stops for Asians, American Indians, and Others to accurately test) and 25 police divisions, we calculate the differences in proportion of stops made before sundown and after sundown.

proportional differences

Each of the subplots’ divisions are sorted by largest difference between visible and non-visible stops. This means that divisions at the top are the divisions that stop more of that race during sunlight hours than at night. For example, Hispanics are definitely being stopped much more frequently during visible hours in the OLYMPIC, PACIFIC, and HOLLYWOOD divisions.

To better understand the significance of our tests, we can analyze the distributions of our p values:

three race prop test pvalues

It is important to understand that we just tested 25 * 3 = 75 different hypotheses. As we test increasingly more hypotheses on the same data set, it becomes more likely that we’ll see a false positive result (i.e. we should expect a few divisions to show small p values and large differences in night vs daylight proportions, it’s becomes more statistically possible).

One way to correct for the multiple comparisons problem involves using the Bonferroni method. Put simply, instead of using the 5% alpha significance level we use a much more conservative 0.05 / 25 = 0.002 = 0.2% significance level.

Applying this correction, we see only a small handful of divisions meeting this requirement. Below is a plot of the divisions showing the test results’ log base 10 p-values (this means the smallest p-values create the largest bars at the bottom).

bonferroni significance plot

Interestingly, all traffic divisions appear and 5/6 division-test results are on whites. If anything, the data suggests a “reverse” racial profiling against white drivers.

Conclusion and Further Work

Before even accounting for the multiple comparison testing issue, we can see that the data is unlikely to suggest that the LA Police Divisions as a whole are racially biased towards Hispanics or Blacks. A few divisions on average could potentially be targeting minorities in their stops, but that is difficult to tease apart from the likely statistical noise associated with so many tests.

One thing we’d like to further investigate is how divisions with higher amounts of crime and/or arrests might target minority races differently. Do officers in higher crime neighborhoods tend to find reasons to pull over minority drivers in vehicles during visible periods?

This Veil-of-Darkness method isn’t bulletproof, of course. We should be sure to understand the limitations of the data and our analysis.  To begin with we only have one year of police stop data.  Furthermore, we would prefer to have a better understanding of which stops lead to further action. In the data set, we can hypothesize that the post_stop_activity column somehow refers to this but I was unable to find any documentation for the definition for  post_stop_activity.  Our results agree with several other reports made using the Veil-of-Darkness methodology where no obvious evidence of racial bias seem to exist by examining the LA police stop data.

Contributors and Special Thanks

This was done by the participants of the NewMet DataScience Bootcamp from August to December 2016.  This analysis was primarily performed by Brian Becker and edited by Annie Flippo.  We would like to give a shout out to our dedicated mentors and contributors for their support:

  • Weixiang Chen (founder of NewMet Data)
  • Kyle Polich
  • Annie Flippo
  • Ethan He
  • Bob Newstadt
  • Lorie Obal

A special thank to the ITA (Information Technology Agency) team and the Chief Data Officer team of the City of Los Angeles, especially Krishna Bhogaonker, for their dedication and support.

Building and Safety Permit Exploration

My wife and I purchased our first home in the Liemert Park neighborhood of Los Angeles last year. While our house was move-in ready, we are planning to invest heavily in maintainance and improvements in the next few years. I’m pretty handy with home repair, but I’m finding time is my real constraint. I’m going to need to hire help.

Horror stories about the home contractor are a prolific cliche. I want to cut through the melodrama and get some data driven understanding about the contractors I might hire. I want someone experienced, who follows all the laws and rules, regardless of how byzantine they might be. I’d also like some clue into how many jobs these companies typically take on, so that I don’t end up working with someone that takes on more business than they can handle.

Of course, a large company can handle more parallel work than a small company. I don’t expect an honest answer if I ask a prospective contractor about their workload, but I do expect an honest headcount. If I can estimate the amount of work they usually take on, I can get a nice metric – projects per employee.

While I presume not every project needs a permit, the permit ata found in the Building and Safety Permit Information dataset can be a useful proxy. It contains information about issued permits for construction, remodeling, and repair of buildings.

I’m only going to look at records that are finalized permits and also drop anything before 2015.

%matplotlib inline
import matplotlib.pyplot as plt
import requests
import json
import numpy as np
import pandas as pd
url = 'https://data.lacity.org/api/views/yv23-pmwf/rows.json?accessType=DOWNLOAD'
r = requests.get(url)
print (r.status_code)
jdata = json.loads(r.content)
columns_metadata = jdata['meta']['view']['columns']
colnames = map(lambda x: x['fieldName'], columns_metadata)
df = pd.DataFrame(data=jdata['data'], columns=colnames)
df['status_date'] = pd.to_datetime(df['status_date'])
filter1 = df['latest_status'] == 'Permit Finaled'
filter2 = df['status_date'] > '2015-01-01'
df2 = df[filter1 & filter2]

I also want to make sure I’m only looking at residential permits. Based on the figure below, I decided to only look at records for the largest class (1 or 2 Family Dwelling) since that’s the category we’d fit into. There are plenty of data points, so I don’t need to worry about using apartment permits which might be similar but are also different.


ptypes = df.groupby(['permit_sub_type'])[':id'].count()
ptypes.sort_values(ascending=1, inplace=True)
x = np.arange(len(ptypes))
plt.barh(x, ptypes)
plt.yticks(x+0.4, ptypes.index)
plt.xlabel('Number of permits in this 2 year time period', fontsize=12)
filter3 = df['permit_sub_type'] == '1 or 2 Family Dwelling'
df2 = df2[filter3]
contractors = df2.groupby(['contractors_business_name'])[':id'].count()
contractors.sort_values(ascending=0, inplace=True)
permits = contractors.tolist()
bins = np.arange(10) * 1 + 1
bins = np.append(bins, 99999999)
h = np.histogram(permits, bins=bins)
x = np.arange(len(h[0]))
plt.bar(x, h[0])
labels = map(lambda x: str(x), bins[0:len(bins)-1])
labels[len(labels)-1] = labels[len(labels)-1] + '+'
plt.xticks(x+0.4, labels, fontsize=22)
plt.xlabel('Histogram of number of permits per contractor', fontsize=22)
plt.xlim(-0.5, 10.5)

In the above histogram, I plotted the distribution over contractors based on how many permits they had issued. The distribution is extremely long tailed, and the modal value is a single permit. I explored a bit more looking for any separation that might exist between companies/professionals and people that do occasional odd jobs. This distribution seems pretty smooth.

What about the most active contractors?

top = contractors[0:10].copy()
x = np.arange(len(top))
plt.barh(x, top)
plt.yticks(x+0.4, top.index)
plt.xlabel('Number of permits in this 2 year time period', fontsize=16)

The first entry is obviously a placeholder and not a single entity. Our close second place goes to Solarcity for most permits issued in this time period. I suspect if I dug further, that would be in part because of government programs that provide attractive incentives to homeowners wanting to install solar panels.

For the sake of comparison, I thought I’d wind up this short exploration with a look into the types of permits people are requesting, since the city also provides a permit type.

permit_types = df.groupby(['permit_type'])[':id'].count()
permit_types.sort_values(ascending=1, inplace=True)
x = np.arange(len(permit_types))
plt.barh(x, permit_types)
plt.yticks(x+0.4, permit_types.index)

So what have I learned?

The contractor businesses are extremely long tailed. The high frequency permit companies (like Solarcity) seem to offer a specific service, which they are able to scale horizontally or franchise out.

The typical small business does only a few jobs per year for which permits are issued. We want to hire a company that is established, experienced, and files all the proper paperwork. If we define that as minimum 20 permits in the last 2 year, then there are 487 companies we have to choose from.

I’m glad this dataset is available for several reasons. First, it allows me to get some background on whether or not a company is established. Second, it allows me to ask about all the work the contractor has done for which there were permits filed, not just the cherry picked testimonials they’d like to spoon feed me. With so many horror stories about contractors, it’s great to be able to do a bit of validation with open data.