Horror stories about the home contractor are a prolific cliche. I want to cut through the melodrama and get some data driven understanding about the contractors I might hire. I want someone experienced, who follows all the laws and rules, regardless of how byzantine they might be. I’d also like some clue into how many jobs these companies typically take on, so that I don’t end up working with someone that takes on more business than they can handle.
Of course, a large company can handle more parallel work than a small company. I don’t expect an honest answer if I ask a prospective contractor about their workload, but I do expect an honest headcount. If I can estimate the amount of work they usually take on, I can get a nice metric – projects per employee.
While I presume not every project needs a permit, the permit ata found in the Building and Safety Permit Information dataset can be a useful proxy. It contains information about issued permits for construction, remodeling, and repair of buildings.
I’m only going to look at records that are finalized permits and also drop anything before 2015.
%matplotlib inline import matplotlib.pyplot as plt import requests import json import numpy as np import pandas as pd
url = 'https://data.lacity.org/api/views/yv23-pmwf/rows.json?accessType=DOWNLOAD' r = requests.get(url) print (r.status_code)
jdata = json.loads(r.content) columns_metadata = jdata['meta']['view']['columns'] colnames = map(lambda x: x['fieldName'], columns_metadata) df = pd.DataFrame(data=jdata['data'], columns=colnames) df['status_date'] = pd.to_datetime(df['status_date']) filter1 = df['latest_status'] == 'Permit Finaled' filter2 = df['status_date'] > '2015-01-01' df2 = df[filter1 & filter2]
I also want to make sure I’m only looking at residential permits. Based on the figure below, I decided to only look at records for the largest class (1 or 2 Family Dwelling) since that’s the category we’d fit into. There are plenty of data points, so I don’t need to worry about using apartment permits which might be similar but are also different.
ptypes = df.groupby(['permit_sub_type'])[':id'].count() ptypes.sort_values(ascending=1, inplace=True) x = np.arange(len(ptypes)) plt.figure(figsize=(15,5)) plt.barh(x, ptypes) plt.yticks(x+0.4, ptypes.index) plt.xlabel('Number of permits in this 2 year time period', fontsize=12) plt.show()
filter3 = df['permit_sub_type'] == '1 or 2 Family Dwelling' df2 = df2[filter3]
contractors = df2.groupby(['contractors_business_name'])[':id'].count() contractors.sort_values(ascending=0, inplace=True) permits = contractors.tolist() bins = np.arange(10) * 1 + 1 bins = np.append(bins, 99999999) h = np.histogram(permits, bins=bins) x = np.arange(len(h)) plt.figure(figsize=(15,5)) plt.bar(x, h) labels = map(lambda x: str(x), bins[0:len(bins)-1]) labels[len(labels)-1] = labels[len(labels)-1] + '+' plt.xticks(x+0.4, labels, fontsize=22) plt.xlabel('Histogram of number of permits per contractor', fontsize=22) plt.xlim(-0.5, 10.5) plt.show()
In the above histogram, I plotted the distribution over contractors based on how many permits they had issued. The distribution is extremely long tailed, and the modal value is a single permit. I explored a bit more looking for any separation that might exist between companies/professionals and people that do occasional odd jobs. This distribution seems pretty smooth.
What about the most active contractors?
top = contractors[0:10].copy() top.sort(ascending=1) x = np.arange(len(top)) plt.barh(x, top) plt.yticks(x+0.4, top.index) plt.xlabel('Number of permits in this 2 year time period', fontsize=16) plt.show()
The first entry is obviously a placeholder and not a single entity. Our close second place goes to Solarcity for most permits issued in this time period. I suspect if I dug further, that would be in part because of government programs that provide attractive incentives to homeowners wanting to install solar panels.
For the sake of comparison, I thought I’d wind up this short exploration with a look into the types of permits people are requesting, since the city also provides a permit type.
permit_types = df.groupby(['permit_type'])[':id'].count() permit_types.sort_values(ascending=1, inplace=True) permit_types x = np.arange(len(permit_types)) plt.figure(figsize=(6,7)) plt.barh(x, permit_types) plt.yticks(x+0.4, permit_types.index) plt.show()
So what have I learned?
The contractor businesses are extremely long tailed. The high frequency permit companies (like Solarcity) seem to offer a specific service, which they are able to scale horizontally or franchise out.
The typical small business does only a few jobs per year for which permits are issued. We want to hire a company that is established, experienced, and files all the proper paperwork. If we define that as minimum 20 permits in the last 2 year, then there are 487 companies we have to choose from.
I’m glad this dataset is available for several reasons. First, it allows me to get some background on whether or not a company is established. Second, it allows me to ask about all the work the contractor has done for which there were permits filed, not just the cherry picked testimonials they’d like to spoon feed me. With so many horror stories about contractors, it’s great to be able to do a bit of validation with open data.