Imagine that you’ve been tasked with strategically allocating scarce police resources across the City of Los Angeles. You, understanding the value of data, decide to do some historical analysis of the data that the City of L.A. has kindly made available via the LA Open Data Portal. With this data in hand, you feel much more confident in being able to be better informed to predict when and where crimes might occur and how to more efficiently utilize crime-fighting police forces. Let’s see the data!
Introducing the Data
We have an entire collection of crime data from Los Angeles from the years 2012 to the first quarter of 2016, inclusively. The data was downloaded as five separate csv files from data.lacity.org in the summer of 2016. Some minimal processing was done to combine the raw data and generate a new dataset with a few additional features for convenience. The data is available as a zipped file.
Altogether, we have 1,018,653 different crimes in the dataset, and include the following features (some derived):
|year_id||character||original dataset id|
|date_rptd||date||date crime occurred|
|date_occ||date||date crime occurred|
|rd||character||nearby road identifier|
|status_desc||character||status outcome of crime|
|location||character||nearby address location|
|cross_st||character||nearby cross street|
|year||numeric||year of crime occurred|
|month||numeric||month of crime occurred|
|day_of_month||numeric||day of month of crime occurred|
|hour_of_day||numeric||hour of day of crime occurred|
|day_of_week||character||day of week of crime occurred|
|simple_crime_bin||character||subjective binning of crimes|
Total Crime Aggregations
A natural first question to ask of the data is “when do the majority of crimes occur during the hours of the day?”. Let’s keep things fairly simple to start with. By aggregating all the data into an average crime rate for each hour of the day, we see an interesting peak at 12:00 PM for each year in the dataset.
Well here’s some unexpected insight! If the hour between noon and 1 PM really is the busiest time for criminals than we should surely act on this information! Inform the police of this interesting finding; transfer crime-fighting resources from evening hours to the lunch time hours. But wait? Is this a real insight gleaned from the data, or is it simply an artifact of how the data was collected? Let’s dig deeper and check which types of crimes contributed to this peak.
Which Crimes Cause These Lunch Time Crime Peaks?
Let’s take a closer look and split by different types of crimes. To start, we’ll classify all 104 unique crime types into 13 simple_crime_bins (this was a highly subjective process!).
Obviously, this noon peak isn’t shared amongst all crimes. In fact, it appears that only fraud, other, sexual, and theft crime types have this lunch time peak. Could these types of crimes’ peaks be attributed to a lesser chance of the victims know when the crimes occurred?
Investigating these Crimes Further
Filtering to the FRAUD, OTHER, SEXUAL, and THEFT crimes to analyze and holding the y scales constant shows that THEFT as being the main culprit for noon peak in number of crimes.
It’s interesting and suspicious that the sudden, narrow peak is very outside of the gradual rise in the THEFT line. This suggests that the crime peak might not be real. This means that 12:00 PM would likely be the default time of day when a time of crime occurrence is unknown, especially given that thefts usually occur when the victim isn’t present. But to make sure, let’s check geographically.
Is Theft Centered at only one Area?
Since this peak is fairly prevalent geographically, it seems more likely the peak is in fact due to how the data is collected and recorded and not from a true peak in criminal activity during lunch time.
Why is This Important?
In my experience, some of the biggest challenges in working with real datasets tend to deal with data that isn’t there, and data that can’t necessarily be trusted as is. In this case, the LA crimes dataset presents the second challenge, and this shouldn’t be taken lightly. Only using crime data to better understand when and how we should staff thin police resources probably wouldn’t be that smart. This also highlights the importance of intuition and domain expertise in any data analysis.