Jovian
⭐️
Sign In

IBM Data Science Capstone Project Notebook

1. Introduction

1.1 Business Problem

My cousin is in the late stages of developing his business plan for his restaraunt, Restaraunte La Tierra De Tacos (which I will be referring to as 'the company'). After careful planning and consideration, the company has chosen San Francisco (SF) as the location for their business. For the company, SF is a good choice given the city's relatively large population, it's vibrant urban setting, strong transportation infratructure, among other factors. However, the company is having trouble choosing the best location for their restaraunt within San Fransisco. The company considers this a vital decision as it may dictate their success (or failure).

The company needs assitance in filtering out the neighborhoods that don't meet their specific requirements in order to conduct detailed location analyses into only a select list of neighborhoods in SF. So, to help find these neighborhoods, I chose the Census tracts designated to the SF county as it most closely resembles the definition of a 'neighborhood' for the company. After careful talk with the company, the following list of requirements that neighborhood must were created:

  1. Contain a population of 5,000 + (increase opportunity of traffic and visibility)

  2. Near at least one college and surrounded by at least three other types of schools (partnerships with schools and other organizations, and to provide catering/delivery options)

  3. Rent at or below the city's average -$3,000

  4. Median income of $50,000 or more (safeguard for tough financial times)

  5. With minimal competition (mexican restaraunts) and similar venues (ie. restaraunts) nearby

It is the aim and scope of this report to identify the neighborhoods in SF that meet the aforementioned requirements.

1.2 Packages Used

In [89]:
#Data manipulation and analyses
import pandas as pd
import numpy as np

#Map visualization
import folium
from folium.plugins import MarkerCluster
from IPython.display import display
from IPython.display import HTML

#Chart visualization
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter

#Poin-in-polygon analyses
from matplotlib import path

#Clustering
from sklearn.cluster import KMeans

#to share notebook
import jovian

1.3 Data Sources

In order to best answer the business question -- finding the neighborhoods in SF that meet aforementioned features -- I have chosen to utilize data from the U.S. Census Bureau and Data SF. Through their several surveys, the U.S. Census Bureau provides one of the best estimates of the current U.S. population demographics at several geographic levels (city, tract, state, etc.). Similar to the Census, SF's official open data portal, Data SF, collects and makes data easily accessible and open to the public. So, due to both SF's and the U.S. Census Bureau agency's reliability, and accessbility of their data, as well as methods, I have decided to use their data for my analysis. I also chose to use FourSquare API data, a location data platform that provides access to detailed data about venues (ie. nearby venues, trending venues) and user's interactions with them (ie. rating, tips). Their data will be useful in searching for nearby venues around select SF neighborhoods.

The following is a list of the data with links that direct you to the data used in this report. You can also access the files from my github account here.

  1. ACS 5 Year Estimates (2013-2017) -- Selected Charactersitics of the Total and Native Populations in the United States (Census Table ID: 601)

Filters used: Selected year 2017 only and all the census tracts within SF county

  1. ACS 5 Year Estimates (2014-2018) -- Selected Housing Characterstics (Census Table ID: DP04)

Filters used: Selected year 2018 only and all the census tracts within SF county.

3. Schools

4. Colleges (2011)

5. Analysis Neighborhoods - 2010 Census Tracts Assigned to Neighborohoods

2 Data Wrangling

2.1 SF Tracts

In [2]:
#loading the first datset
pop =pd.read_csv(r"C:\Users\garci\Desktop\ACS_17_5YR_S0601_with_ann.csv",skiprows=1)
In [3]:
pop.shape
Out[3]:
(197, 163)

Let's start by cleaning and retrieving only the relevant data from the SF Main dataset.

In [4]:
reduce = pop.columns.to_list()
keep =[i for i in range(len(reduce)) if reduce[i].startswith('Total; Estimate')]
keep.append(2) #add geography column since it doesn't start with 'Total; Estimate'
pop = pop.iloc[:, keep] #grab all the rows but only the columns filtered in keep
      
#keep only 'Census Tract {number}' 
pop['Geo']=pop['Geography'].str.replace(', San Francisco County, California', '',regex=True)

print(pop.shape)
(197, 22)

Now we only keep the SF Tracts that meet the first two requirements: >5,000 total population and >=50,000 median income per tract.

In [5]:
SF = pop[pop['Total; Estimate; Total population']> 5000] 
SF=SF.astype({'Total; Estimate; Median income (dollars)':'int64'})
SF=SF[SF['Total; Estimate; Median income (dollars)'] >= 50000]
#filter_criteria =SF['Geo'].to_list() 
SF.reset_index(drop=True,inplace=True)
print(SF.shape)
SF.head(2)
(18, 22)
Out[5]:

2.2 Schools and Colleges

For the next two datesets, we will convert each json string files into a Pandas dataframe.

In [6]:
schools_geo = pd.read_json(r"C:\Users\garci\Desktop\San_Francisco\Schools.geojson")
colleges_geo = pd.read_json(r"C:\Users\garci\Desktop\San_Francisco\Colleges in San Francisco (2011).geojson")

We then create two functions. The first appends three new columns (school name and coordinates) to the existing dataset. The second one checks for null or Nan items.

In [7]:
def frames (data, propertiesName):
    for i in range(len(data)): #get the locations
        data.at[i, 'School Name'] = data['features'][i]['properties'][propertiesName] 
        data.at[i,'lat'] = data['features'][i]['geometry']['coordinates'][1] #grabs the second item in the tuple (latitude)
        data.at[i,'long'] = data['features'][i]['geometry']['coordinates'][0] #grabs the first item in the tuple (longitude)
    return data

def check (data, df):
    print('Number of null items: {}'.format(data.isnull().sum()))
    print('Number of NaN items: {}'.format(data.isna().sum()))
In [8]:
schools = frames(schools_geo,'campus_name')

print('Dataframe shape before preprocessing:',schools.shape)
check(schools, schools[['School Name','lat','long']])
print('--------------------------------------------------------------------------------------')
drops1 = list(schools[schools['School Name'].duplicated()].index.values)
print('Duplicates. Indexes to drop:', drops1)

schools.drop(columns=['type','features'],index =drops1,inplace=True)
print('Dataframe shape after preprocessing: ',schools.shape)
schools.head(5)
print('--------------------------------------------------------------------------------------')

colleges_geo = pd.read_json(r"C:\Users\garci\Desktop\San_Francisco\Colleges in San Francisco (2011).geojson")


colleges = frames(colleges_geo,'institution')

print('Dataframe shape before preprocessing:',colleges.shape)
check(colleges, colleges[['School Name','lat','long']])
print('--------------------------------------------------------------------------------------')
drops2 =list(colleges[colleges['School Name'].duplicated()].index.values)
print('Duplicates. Indexes to drop:', drops2)

colleges.drop(columns=['type','features'],index =drops2, inplace=True)

colleges.reset_index(drop=True,inplace=True)
print('Dataframe shape after preprocessing: ',colleges.shape)

print('--------------------------------------------------------------------------------------')
Dataframe shape before preprocessing: (445, 5) Number of null items: type 0 features 0 School Name 0 lat 0 long 0 dtype: int64 Number of NaN items: type 0 features 0 School Name 0 lat 0 long 0 dtype: int64 -------------------------------------------------------------------------------------- Duplicates. Indexes to drop: [83, 190, 281, 291, 297] Dataframe shape after preprocessing: (440, 3) -------------------------------------------------------------------------------------- Dataframe shape before preprocessing: (46, 5) Number of null items: type 0 features 0 School Name 0 lat 0 long 0 dtype: int64 Number of NaN items: type 0 features 0 School Name 0 lat 0 long 0 dtype: int64 -------------------------------------------------------------------------------------- Duplicates. Indexes to drop: [2, 7, 10, 11, 12, 14, 16, 17, 18, 19, 23, 25, 27, 28, 30, 32, 33, 35, 36, 37, 40, 41, 43, 44, 45] Dataframe shape after preprocessing: (21, 3) --------------------------------------------------------------------------------------

The dataframe has been cleaned and no erros, duplicates or null values have appeared.

2.3 Select Housing Characteristics

Next up, is the housing datset.This datset contains many irrelevant columns. We will only be needing two columns.

In [9]:
land= pd.read_csv(r"C:\Users\garci\Desktop\San_Francisco\ACSDP5Y2018.DP04_data_with_overlays_2019-12-20T222527.csv", skiprows=1)

print(land.shape)
col =land.columns.to_list() #save the column (574 total) names to a list 
(197, 574)

First, we filter through all the columns and retrieve only the ones that contain the keywords of interest.

In [10]:
[i for i in col  if 'rent' in i.lower() or 'value' in i.lower()]
Out[10]:
['Estimate!!HOUSING OCCUPANCY!!Total housing units!!Rental vacancy rate',
 'Margin of Error!!HOUSING OCCUPANCY!!Total housing units!!Rental vacancy rate',
 'Percent Estimate!!HOUSING OCCUPANCY!!Total housing units!!Rental vacancy rate',
 'Percent Margin of Error!!HOUSING OCCUPANCY!!Total housing units!!Rental vacancy rate',
 'Estimate!!HOUSING TENURE!!Occupied housing units!!Renter-occupied',
 'Margin of Error!!HOUSING TENURE!!Occupied housing units!!Renter-occupied',
 'Percent Estimate!!HOUSING TENURE!!Occupied housing units!!Renter-occupied',
 'Percent Margin of Error!!HOUSING TENURE!!Occupied housing units!!Renter-occupied',
 'Estimate!!HOUSING TENURE!!Occupied housing units!!Average household size of renter-occupied unit',
 'Margin of Error!!HOUSING TENURE!!Occupied housing units!!Average household size of renter-occupied unit',
 'Percent Estimate!!HOUSING TENURE!!Occupied housing units!!Average household size of renter-occupied unit',
 'Percent Margin of Error!!HOUSING TENURE!!Occupied housing units!!Average household size of renter-occupied unit',
 'Estimate!!VALUE!!Owner-occupied units',
 'Margin of Error!!VALUE!!Owner-occupied units',
 'Percent Estimate!!VALUE!!Owner-occupied units',
 'Percent Margin of Error!!VALUE!!Owner-occupied units',
 'Estimate!!VALUE!!Owner-occupied units!!Less than $50,000',
 'Margin of Error!!VALUE!!Owner-occupied units!!Less than $50,000',
 'Percent Estimate!!VALUE!!Owner-occupied units!!Less than $50,000',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!Less than $50,000',
 'Estimate!!VALUE!!Owner-occupied units!!$50,000 to $99,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$50,000 to $99,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$50,000 to $99,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$50,000 to $99,999',
 'Estimate!!VALUE!!Owner-occupied units!!$100,000 to $149,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$100,000 to $149,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$100,000 to $149,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$100,000 to $149,999',
 'Estimate!!VALUE!!Owner-occupied units!!$150,000 to $199,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$150,000 to $199,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$150,000 to $199,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$150,000 to $199,999',
 'Estimate!!VALUE!!Owner-occupied units!!$200,000 to $299,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$200,000 to $299,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$200,000 to $299,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$200,000 to $299,999',
 'Estimate!!VALUE!!Owner-occupied units!!$300,000 to $499,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$300,000 to $499,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$300,000 to $499,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$300,000 to $499,999',
 'Estimate!!VALUE!!Owner-occupied units!!$500,000 to $999,999',
 'Margin of Error!!VALUE!!Owner-occupied units!!$500,000 to $999,999',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$500,000 to $999,999',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$500,000 to $999,999',
 'Estimate!!VALUE!!Owner-occupied units!!$1,000,000 or more',
 'Margin of Error!!VALUE!!Owner-occupied units!!$1,000,000 or more',
 'Percent Estimate!!VALUE!!Owner-occupied units!!$1,000,000 or more',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!$1,000,000 or more',
 'Estimate!!VALUE!!Owner-occupied units!!Median (dollars)',
 'Margin of Error!!VALUE!!Owner-occupied units!!Median (dollars)',
 'Percent Estimate!!VALUE!!Owner-occupied units!!Median (dollars)',
 'Percent Margin of Error!!VALUE!!Owner-occupied units!!Median (dollars)',
 'Estimate!!GROSS RENT!!Occupied units paying rent',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!Less than $500',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!Less than $500',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!Less than $500',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!Less than $500',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$500 to $999',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$500 to $999',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$500 to $999',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$500 to $999',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$1,000 to $1,499',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$1,000 to $1,499',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$1,000 to $1,499',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$1,000 to $1,499',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$1,500 to $1,999',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$1,500 to $1,999',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$1,500 to $1,999',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$1,500 to $1,999',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$2,000 to $2,499',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$2,000 to $2,499',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$2,000 to $2,499',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$2,000 to $2,499',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$2,500 to $2,999',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$2,500 to $2,999',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$2,500 to $2,999',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$2,500 to $2,999',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!$3,000 or more',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!$3,000 or more',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!$3,000 or more',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!$3,000 or more',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!Median (dollars)',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!Median (dollars)',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!Median (dollars)',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!Median (dollars)',
 'Estimate!!GROSS RENT!!Occupied units paying rent!!No rent paid',
 'Margin of Error!!GROSS RENT!!Occupied units paying rent!!No rent paid',
 'Percent Estimate!!GROSS RENT!!Occupied units paying rent!!No rent paid',
 'Percent Margin of Error!!GROSS RENT!!Occupied units paying rent!!No rent paid',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Less than 15.0 percent',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Less than 15.0 percent',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Less than 15.0 percent',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Less than 15.0 percent',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!15.0 to 19.9 percent',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!15.0 to 19.9 percent',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!15.0 to 19.9 percent',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!15.0 to 19.9 percent',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!20.0 to 24.9 percent',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!20.0 to 24.9 percent',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!20.0 to 24.9 percent',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!20.0 to 24.9 percent',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!25.0 to 29.9 percent',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!25.0 to 29.9 percent',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!25.0 to 29.9 percent',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!25.0 to 29.9 percent',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!30.0 to 34.9 percent',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!30.0 to 34.9 percent',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!30.0 to 34.9 percent',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!30.0 to 34.9 percent',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!35.0 percent or more',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!35.0 percent or more',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!35.0 percent or more',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!35.0 percent or more',
 'Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Not computed',
 'Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Not computed',
 'Percent Estimate!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Not computed',
 'Percent Margin of Error!!GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME (GRAPI)!!Occupied units paying rent (excluding units where GRAPI cannot be computed)!!Not computed']

After going through columns, we retrieve the two columns that we need, check for potential missing values and remove unecessary values.

In [11]:
rent =land[['Geographic Area Name','Estimate!!GROSS RENT!!Occupied units paying rent!!Median (dollars)','Estimate!!VALUE!!Owner-occupied units!!Median (dollars)']]

print('Total null:', rent.isnull().sum()[0])
print('Total NaN:',rent.isna().sum()[0])
print('Dataframe shape:', rent.shape)

rent['Geographic Area Name']=rent['Geographic Area Name'].str.replace(', San Francisco County, California', '',regex=True)

rent.rename(columns = {'Geographic Area Name': 'Tract','Estimate!!GROSS RENT!!Occupied units paying rent!!Median (dollars)': 'Estimated Median Gross Rent','Estimate!!VALUE!!Owner-occupied units!!Median (dollars)': 'Estimated Median Values (Owner-Occupied Units)'},inplace =True)

rent.head(2)
Total null: 0 Total NaN: 0 Dataframe shape: (197, 3)
C:\Users\garci\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy import sys C:\Users\garci\Anaconda3\lib\site-packages\pandas\core\frame.py:4223: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().rename(**kwargs)
Out[11]:

Now we retrieve only rows that are in the main SF datset and filter both datsets by the third requirement: median rent < $3,000

In [12]:
rent = rent[rent['Tract'].isin(SF['Geo'])]
rent = rent[rent['Estimated Median Gross Rent'].astype(int) < 3000].reset_index(drop=True) 
SF= SF[SF['Geo'].isin(rent['Tract'])].reset_index(drop=True)
print('Shape after filtering:', rent.shape)
rent.head(2)
Shape after filtering: (16, 3)
Out[12]:

2.4 TRACT BOUNDARIES

Now, it's time to clean the last datset.

In [13]:
neighborhoods = pd.read_json(r"C:\Users\garci\Desktop\Census 2010_ Tracts for San Francisco.geojson")

We start by creating a list of the selected SF tracts which we will be using to filter through the Json data. We also create another json style dictionary (variable data) to keep the multipolygon coordinates (geographic boundaries) of the selected SF tracts.

In [14]:
filters = list(SF['Geo'].values)
print('The elements listed are the SF tracts to keep.\n\n', filters)
data = {'type': 'FeatureCollection','features':[]}
The elements listed are the SF tracts to keep. ['Census Tract 154', 'Census Tract 165', 'Census Tract 166', 'Census Tract 167', 'Census Tract 207', 'Census Tract 215', 'Census Tract 252', 'Census Tract 301.02', 'Census Tract 303.01', 'Census Tract 304', 'Census Tract 308', 'Census Tract 309', 'Census Tract 311', 'Census Tract 402', 'Census Tract 614', 'Census Tract 615']
In [15]:
for i in range(len(neighborhoods['features'])):
    if neighborhoods['features'][i]['properties']['namelsad10'] in  filters:
    #keep only the multipolygon coordinates for the selected SF Tracts
        data['features'].append(neighborhoods['features'][i]) 
print(len(data['features']), len(SF))
16 16

Next, let's retrieve the point coordinates for each selected SF tract.

In [16]:
n =pd.DataFrame(neighborhoods['features'])
#retrieve the properties column which contains the lat and long of the SF tracts
n =pd.DataFrame([n['features'][i]['properties'] for i in range(len(n['features']))]) 

n.sort_values(by=['namelsad10'],inplace=True) #to compare in the following loop
n.reset_index(inplace=True,drop=True)

for a, b in zip(n['namelsad10'],pop['Geo']):
    if a !=b: print('something went wrong',a,b) #making sure we correctly retrieved the SF tracts 

print(len(pop), len(n)) 
197 197
In [17]:
#retrieve only the selected SF tracts
n =n[n['namelsad10'].isin(SF['Geo'])].reset_index(drop=True) 

# compare against the main SF dataframe
for a,b in zip(n['namelsad10'],SF['Geo']):
    if a !=b: print('something went wrong',a,b)

#add the lat and long point coordinates from the n dataframe to the SF dataframe
for x in range(len(n)):
    #we skip the first character which always starts with a '+'
    SF.at[x,'lat'] = n['intptlat10'][x][1:] 
    SF.at[x, 'long'] = n['intptlon10'][x] 

print('Shape of main SF dataset:', SF.shape)
Shape of main SF dataset: (16, 24)
In [18]:
SF.head(3)
Out[18]:

The next step is merging the rent and main SF dataset together.

In [19]:
for a,b in zip(rent['Tract'], SF['Geo']):
    if a != b: print('Something is wrong:',a,b)#for sanity check

for x in range(len(SF)): #merge at index
    SF.at[x,'Estimated Median Gross Rent']= rent['Estimated Median Gross Rent'][x]
    SF.at[x,'Estimated Median Values (Owner-Occupied Units)'] = rent['Estimated Median Values (Owner-Occupied Units)'][x]

The last step in this cleaning process, is to do condense and change the types of some of the data.

In [20]:
SF.columns # copy and paste into next cell and cut down the columns that aren't needed
Out[20]:
Index(['Total; Estimate; Total population',
       'Total; Estimate; AGE - Under 5 years',
       'Total; Estimate; AGE - 5 to 17 years',
       'Total; Estimate; AGE - 18 to 24 years',
       'Total; Estimate; AGE - 25 to 44 years',
       'Total; Estimate; AGE - 45 to 54 years',
       'Total; Estimate; AGE - 55 to 64 years',
       'Total; Estimate; AGE - 65 to 74 years',
       'Total; Estimate; AGE - 75 years and over',
       'Total; Estimate; Median age (years)', 'Total; Estimate; SEX - Male',
       'Total; Estimate; SEX - Female',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - White',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - Black or African American',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - American Indian and Alaska Native',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - Asian',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - Native Hawaiian and Other Pacific Islander',
       'Total; Estimate; RACE AND HISPANIC OR LATINO ORIGIN - One race - Some other race',
       'Total; Estimate; Median income (dollars)', 'Geography', 'Geo', 'lat',
       'long', 'Estimated Median Gross Rent',
       'Estimated Median Values (Owner-Occupied Units)'],
      dtype='object')
In [21]:
#the columns to keep
SF=SF[['Geo','Total; Estimate; Total population','Estimated Median Gross Rent',  
        'Estimated Median Values (Owner-Occupied Units)', 'Total; Estimate; Median income (dollars)',
      'Total; Estimate; Median age (years)', 'Total; Estimate; SEX - Male',
       'Total; Estimate; SEX - Female', 'lat',   'long', ]]
In [22]:
#rename columns at place 
SF.columns = 'Tract','Total Pop','Estimated Median Gross Rent', \
       'Estimated Median Values (Owner-Occupied Units)', \
       'Median Income', 'Median Age', 'Total Male %', 'Total Female %', 'lat','long' 
In [23]:
SF.head(2)
Out[23]:
In [24]:
#change the types of the following columns
SF_Summary=SF.astype({'Total Pop':'int64', 'Median Income':'int64','Total Male %':'float64','Total Female %':'float64',
                      'Median Age':'float64', 'Estimated Median Gross Rent': 'int64',
                      'Estimated Median Values (Owner-Occupied Units)': 'int64'})
In [25]:
SF_Summary.info() 
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 10 columns): Tract 16 non-null object Total Pop 16 non-null int64 Estimated Median Gross Rent 16 non-null int64 Estimated Median Values (Owner-Occupied Units) 16 non-null int64 Median Income 16 non-null int64 Median Age 16 non-null float64 Total Male % 16 non-null float64 Total Female % 16 non-null float64 lat 16 non-null object long 16 non-null object dtypes: float64(3), int64(4), object(3) memory usage: 1.4+ KB

The data is now clean and formatted!

2.5 Foursquare API Call

To retrieve the nearby venues using the Forusquare API we'll first need to save our credential into variables.

In [56]:
#CREDENTIALS (PERSONAL SO REMOVED

Next we define our parameters (click here for more info) and create function to retrieve the nearby venues for eac tract.

In [88]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

venues= getNearbyVenues(names=SF['Tract'], latitudes=SF['lat'],longitudes=SF['long'])
print('Venues successfully retrieved!')
Venues successfully retrieved!

After retrieving the venues, we now need to clean the venues dataframe.

In [27]:
print(venues.shape)
venues.head(2)
(764, 8)
Out[27]:
In [28]:
#set up to filter by these criteria. So, tuples can't contain the same coordinates as well as venue name. 
dr = [(a,b,c) for a,b,c in zip(venues['Venue Longitude'], venues['Venue Latitude'], venues['Venue'])]
print(len(dr),len(venues))
print('Example of an element in dr:',dr[0])
764 764 Example of an element in dr: (-122.45093717321892, 37.7865165121848, "Bryan's Market")
In [29]:
#Create a dictionary that counts the number of times that a tuple appears 
count_map={}
for i in dr: count_map[i] =count_map.get(i,0)+1
print(len(count_map),"\n", "Example:",list(count_map)[0:2])

keys =list(count_map.keys()) 
711 Example: [(-122.45093717321892, 37.7865165121848, "Bryan's Market"), (-122.44903868903226, 37.78278387882597, 'Tofino Wines')]

We now need to match the venues with their corresponding venues. Note that the venues' coordinate points will have to lie within a boundary of a selected SF tracts thus any that don't will be filtered out. This will reduce redundancy in the data and also be useful for our analysis in the next section.

In [30]:
new= [] 
for x in range(len(data['features'])):
    poly = data['features'][x]['geometry']['coordinates']
    p =path.Path(poly[0][0]) #the multipolygon (two dimensional)
    for i in range(len(keys)):
#grab the first two items as lists of each tuple to assess if it's within the select polygon 
        if p.contains_point(list(keys[i][0:2])) ==True: 
            tract =data['features'][x]['properties']['namelsad10'] 
            new.append((tract, keys[i]))#add tract name with corresponding venue name and coordinates of venues(lat and long)
len(new)  
Out[30]:
454

Lastly, we create a new dataframe from the list filter that we created.

In [31]:
for i in range(len(new)):   
    if i==0: VENUES = venues[(venues['Neighborhood']==new[0][0]) & (venues['Venue'] ==new[0][1][2])]
    else: VENUES = VENUES.append(venues[(venues['Neighborhood']==new[i][0]) & (venues['Venue'] ==new[i][1][2])], ignore_index=True)
In [32]:
#drop any left over duplicate value(s)
i = list(VENUES[VENUES.duplicated()].index)
VENUES.drop(VENUES.index[i],inplace=True)
print(len(VENUES))
454
In [33]:
VENUES.drop(columns='Unnamed: 0',inplace=True)
VENUES.reset_index(drop=True)
VENUES.head(2)
Out[33]:

Lastly we print out how many items were deleted from the venues dataset.

In [34]:
print('There are {} uniques categories.'.format(len(VENUES['Venue Category'].unique())))

print('Percentage of rows kept: {:.2%}'.format(len(VENUES)/len(dr)))
print('Dataframe rows before preprocessing: {} \n Dataframe rows after preprocessing: {}'.format(len(dr),VENUES.shape[0]))
There are 155 uniques categories. Percentage of rows kept: 59.42% Dataframe rows before preprocessing: 764 Dataframe rows after preprocessing: 454

We did a lot here. We first started by cleaning the main SF datset and filtered out the tracts that didn't meet first two requirements. This left us with 18 out of the original 197 SF tracts. We then proceeded to convert the schools' and colleges' json files into a dataframe and took care of the duplicates values that they had. From there, we retrieved two select columns from the massive rent dataset and kept the tracts that met third business requirement. We were then left with 16 tracts. As for the SF geographic boundary datset, we created two datsets one with point coordinates and the other with the boundary coordinates for each of the selected tracts. Lastly, we retrieved and cleaned the Foursquare data that consisted of nearby venues around the selcted SF tracts. We made sure that the data was clean and within a boundary of one of the slected SF tracts.

3. Data Analyses

3.1 Initial Neighborhood Visualization

For the first visualization we want to visualize the locations of the selected SF tracts. First we create two functions that will allow us to create a choropleth map and either map markers orccluster markers.

In [35]:
def choropleth_map(map_name,geo_data,DF,layer_name, columns,color,fill,line,legend_name):
    return map_name.choropleth(geo_data=geo_data,
           data=DF,  name=layer_name, columns=columns, key_on='feature.properties.namelsad10',
           fill_color=color, fill_opacity=fill, line_opacity=line,legend_name= legend_name )
In [36]:
def f(data,iconName,firstColumn,info, tooltip,color,icon,mc =False):
    feature_group = folium.FeatureGroup(name =iconName)
    if mc ==True:
        mc =MarkerCluster(name =iconName)
        for note, lat,long in zip(data[firstColumn],data['lat'],data['long']):
            mc.add_child(folium.Marker([lat,long],popup=note, tooltip= 'Click for More Info', icon=folium.Icon(color =color,icon=icon,prefix ='fa')))
        return mc
    else: 
        for note, lat,long,tool in zip(data[firstColumn],data['lat'],data['long'],data[tooltip]):#make sure to fix the lat long, get the appropiate location!!lol
            folium.Marker([lat,long],popup='{} {}'.format(info,tool), tooltip= note, icon=folium.Icon(color =color,icon=icon,prefix ='fa')).add_to(feature_group)
        return feature_group

For the first visualization let's create a choropleth map with the total population as the layer. In addition, let's add popups showing the median income along with the respective tract that it refers to. We do this by using the folium library in Python. The Folium library allows for creation of interactive leaflet map using coordinate data.

In [74]:
sf_map = folium.Map(location=[37.7749, -122.4194], zoom_start=12)
choropleth = choropleth_map(sf_map,data,SF_Summary,'Population Layer',['Tract','Total Pop'],'YlOrRd',.75,.45,'Population Per Neighborhood')

feature_group = f(SF_Summary, 'Neighborhood', 'Tract','Median Income:','Median Income','darkblue','map-marker')
feature_group.add_to(sf_map)

folium.LayerControl(collapsed=True).add_to(sf_map)
sf_map
Out[74]:

Looking at the map, you can see that most of the districts that meet the first three criterias form clusters that are roughly located around the center and upper east side of SF.

3.2 Venue Analyses

To get a better understanding of the venues data, let's group the data by neighborhood to visualize the nearby venues count for each tract.

In [65]:
plot1 =VENUES.groupby('Neighborhood')[['Neighborhood','Venue']].count().sort_values(by='Venue')
ax = plot1.plot(kind='barh',figsize=(10,9),color ='#86bf91',zorder=2,width=0.8,alpha=.7,fontsize=12)


ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

ax.set_xlabel('Nearby Venues Per Neighborhood',labelpad=20,weight='bold',size=14)
ax.set_ylabel('Neighborhood Tract',labelpad=20,weight='bold',size=15)

vals=ax.get_xticks()
ax.get_legend().remove()
for tick in vals: ax.axvline(x=tick,linestyle='dashed',alpha=0.6,color='#eeeeee',zorder=1)

ax.tick_params(top=False, bottom=False, left=False,right=False, labelleft=True, labelbottom=False)
for i in range(len(plot1)):
    ax.text(ax.patches[i].get_width()+.2,ax.patches[i].get_y()+.2,str(ax.patches[i].get_width()) \
          , fontsize=12,color='grey')
plt.savefig('first1.png',bbox_inches='tight')
Notebook Image

As the chart shows, only a few of the neighborhoods that matched the comapany's first three criterias had close to 81 nearby venues with roughly half of the tracts having a nearby venue count that is less than 19.

For my next graph, given the large count of unique venue categories (155 total), let's visualize the 10 most common venue category from the census tracts.

In [63]:
plot2 = VENUES.groupby('Venue Category')[['Venue']].count().sort_values(by='Venue',ascending=False).head(10)

norm = plt.Normalize(0,plot2['Venue'].values.max())
colors=plt.cm.Blues(norm(plot2['Venue'].values))
ax2 = plot2['Venue'].plot(kind='bar',figsize=(15,9),zorder=2,width=0.6,rot=45,color=colors,fontsize=14)
ax2.set_xlabel('')
ax2.set_title('10 Most Frequently Occuring Venues Types in Select SF Neighborhoods',fontsize=16)

ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax2.tick_params(top=False, bottom=False, left=False,right=False, labelleft=False, labelbottom=True)

for p in ax2.patches:
    b=p.get_bbox()
    val = '{}'.format(int(b.y1))
    ax2.annotate(val,((b.x0+b.x1-.3)/2, b.y1+1),fontsize=14)
plt.savefig('2nd.png',bbox_inches='tight')
Notebook Image

As the figure shows, coffee shops (in combination with the Cafe venue) were by far the most frequently occuring venues. Parks were also quite common. The amount of parks and trails were not too suprising given that around three of the tracts were nearby or at a rural area (see map above). The coffee shops did however come at a suprise, it seems like the residents in these tracts love their cup of coffee.

3.3 Venue Category - One-hot Encoding

In order to segment the tracts according to their venue categories, we create another dataframe by applying the one-hot encoding technique for the venue categories.

In [40]:
# Let's create dummy columns (with values of 0 for absent)
venue_onehot = pd.get_dummies(VENUES[['Venue Category']], prefix="", prefix_sep="")

# Add the neighborhood column to the venue_onehot datframe
venue_onehot['Neighborhood'] = VENUES['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns= [venue_onehot.columns[int(venue_onehot.columns.get_loc('Neighborhood'))]]+ \
                   [venue_onehot.columns[i] for i in range(len(venue_onehot.columns)) \
                if i != int(venue_onehot.columns.get_loc('Neighborhood'))]
venue_onehot = venue_onehot[fixed_columns]

#Lastly create a dataframe with grouped by the mean of each venue category per SF tract
sf_grouped = venue_onehot.groupby('Neighborhood').mean().reset_index()
sf_grouped.head(3)
Out[40]:

Next we get the top 10 most common venues for each SF tract. Note that the code below and above was borrowed from the IBM Capstone example.

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']

for ind in np.arange(sf_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)


neighborhoods_venues_sorted.head()
Out[41]:

To answer the fourth question, let's segment the tracts. This will allow us to select the cluster that contains roughly the same type of venues as the company (food-related venues). From there, we can select the tracts that contain the least competition for the company which in this case, will be Mexican food venues.

3.4 K-Means Clustering

Let's start by segmenting the tracts based on their venue categories by using the K-Means clustering algorithm from the scikit-learn library. Since there are only 16 tracts, I chose 3 as the clusters that I wanted to create.

In [42]:
kclusters= 2
temp = sf_grouped.drop('Neighborhood',1)# to be able to cluster only the venue categories
#run Kmeans with different centorid seeds and select the best n-starting point
kmeans= KMeans(n_init = 50,n_clusters=kclusters, random_state=0).fit(temp)

print(len(kmeans.labels_))
print(kmeans.labels_)

#insert the resulting cluster labels to the datframe containing the top 10 most commmon venues.
neighborhoods_venues_sorted.insert(0,'ClusterLabels',kmeans.labels_)
16 [0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0]

Let's visualize the resulting clusters. To do that we first retrieve the columns containing the census tracts for each cluster. We then create a column with the index poisitions as the dataframe did not reset it. We finish it off by grabbing the respective coordinates for each cluster.

In [44]:
CL0 =neighborhoods_venues_sorted[neighborhoods_venues_sorted['ClusterLabels'] == 0]
CL1 =neighborhoods_venues_sorted[neighborhoods_venues_sorted['ClusterLabels'] == 1]

i1 = CL0.index.to_list()
i2 = CL1.index.to_list()

for i in i1: CL0.at[i, 'lat']= SF_Summary[SF_Summary['Tract'].isin(CL0['Neighborhood'])]['lat'][i]
for i in i1: CL0.at[i, 'long']= SF_Summary[SF_Summary['Tract'].isin(CL0['Neighborhood'])]['long'][i]

for i in i2: CL1.at[i, 'lat']= SF_Summary[SF_Summary['Tract'].isin(CL1['Neighborhood'])]['lat'][i]
for i in i2: CL1.at[i, 'long']= SF_Summary[SF_Summary['Tract'].isin(CL1['Neighborhood'])]['long'][i]
  
C:\Users\garci\Anaconda3\lib\site-packages\pandas\core\indexing.py:376: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.obj[key] = _infer_fill_value(value) C:\Users\garci\Anaconda3\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.obj[item] = s

Now let's map the clusters!

In [45]:
def circleMarkers(name, lat, lon, tract, label, color, fill): 
    fg = folium.FeatureGroup(name =name)
    for lat, lon, tract, cluster in zip(lat, lon, tract, label):
        label = folium.Popup(str(tract) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker( [lat, lon], radius=9,popup=label, color=color,
        fill=True,  fill_color=fill, fill_opacity=0.6).add_to(fg)
    return fg
In [75]:
map_clusters = folium.Map(location=[37.7749, -122.4194], zoom_start=12)
FG0=circleMarkers('Cluster 0', CL0['lat'],CL0['long'],CL0['Neighborhood'],CL0['ClusterLabels'], 'blue', 'blue')
FG0.add_to(map_clusters)

FG1= circleMarkers('Cluster 1',CL1['lat'],CL1['long'],CL1['Neighborhood']  ,CL1['ClusterLabels'], 'red', 'red')
FG1.add_to(map_clusters)

folium.LayerControl().add_to(map_clusters)
map_clusters
Out[75]:

It's interesting to see how the tracts are scttered around the mid-center of SF. In addition, most of the tracts in cluster 1 are closer to the busiest parts of SF located in the upper-eastside of the city.

Now let's see the two clusters that were created.

In [47]:
neighborhoods_venues_sorted.loc[neighborhoods_venues_sorted['ClusterLabels'] == 0, neighborhoods_venues_sorted.columns[[1] + list(range(2, neighborhoods_venues_sorted.shape[1]))]]
Out[47]:
In [48]:
neighborhoods_venues_sorted.loc[neighborhoods_venues_sorted['ClusterLabels'] == 1, neighborhoods_venues_sorted.columns[[1] + list(range(2, neighborhoods_venues_sorted.shape[1]))]]
Out[48]:

From looking at the results, the cluster labeled as 1 seems to contain outdoor venues as their top 5 most common venue category (park,trails, and mountain). This is followed by dessert-related venues (donut shops and dumpling restaraunts). On the other hand, cluster label 0 is dominated by coffee shops and restraunts as their most common venues. However, there are some tracts in cluster 1 with some outdoor venues. This makes sense though, as those tracts are very near the other tracts in cluster 0 which as stated previously, is characterized by it's outdoor venues. Despite this, I have chosen cluster 0 as the cluster of focus, as most of the top venues are similar to the company's business type (food service).

However, now we need to filter out from the tracts from cluster 0 that contain at least one Mexican-food related venue which are the following: 166, 167, 252 , 308. Since the company requires the least competition, it seems fair to drop these tracts since their most common venues include a Mexican-related food venue.

In [49]:
CL0=CL0[~CL0['Neighborhood'].isin(['Census Tract 166','Census Tract 167','Census Tract 252', 'Census Tract 308'])]

3.5 Visualizing and Examining the Clusters

To be able to answer the last requirement let's visualize the remaining tracts along with the schools and colleges on the folium map

In [76]:
map_clusters2 = folium.Map(location=[37.7749, -122.4194], zoom_start=12)
FG0=circleMarkers('Cluster 0', CL0['lat'],CL0['long'],CL0['Neighborhood'],CL0['ClusterLabels'],'blue','blue')
FG0.add_to(map_clusters2)

feature_group2 = f(schools,'Schools','School Name','', 'School Name','lightblue','book',mc =True)
feature_group2.add_to(map_clusters2)

feature_group3 = f(colleges,'Colleges','School Name','','School Name','blue','university')
feature_group3.add_to(map_clusters2)

folium.LayerControl().add_to(map_clusters2)
map_clusters2
Out[76]:

After a quick analyses, we can observe tha tracts 402 and 311 didn't meet the last requirement.

In [84]:
CL0=CL0[~CL0['Neighborhood'].isin(['Census Tract 402', 'Census Tract 311'])]
In [85]:
SF =SF_Summary[SF_Summary['Tract'].isin(CL0['Neighborhood'])].sort_values('Median Income',ascending=False)
In [86]:
SF.head()
Out[86]:
In [87]:
SF.describe(include='all').round(2)
Out[87]:

4 Conclusions

4.1 Results and Discussion

Our analysis shows that our selected tracts venue categories are characterized by coffee shops, outdoor venues such as parks, and asian restaraunts. This makes sense since most of the tracts that these venues belong too are located in areas near several colleges and are centered around parks or trails. We can also assume that these neighborhoods have a relatively large Asian population given the popularity of Asian restaraunts observed. In regards to their spatial location in SF, the tracts observed seems to be scattered around the mid-center of the city.

After digging further into the selected tracts, we found seven tracts/neighborhood candidates for the business to choose from. These tracts were found to be affluent with a median income mean of 77,701 dollars, and an average gross rent of 2,229 dollars. In particular, tract 615 stood out with their population size nearly doubling the others (13,864), relatively high median income ($103,451), and proximity to several colleges. Special consideration should be given to this tract.

It should be noted, that these results provide only a direction at which the company can conduct further and more detailed anaylsis of the suggested neighborhood locations. Given the methods used in finding the answer to the business question, I can only conclude that the candidate tracts are potential good locations for the company to look futher into as the requirements outlined in the introduction of this report. Despite the limitations mentioned, this report is still useful for the company as it condenses the number of locations that they have to anaylze.

4.2 Conclusion and Final Remarks

Restaraunte Los Tacos Del Mar had trouble finding a good location within SF to start their business given the size of SF and large quantity of neighborhoods. In order to find a location that aligns with their busiiness strategy, they came up with five requirements that a neighborhood needed to have: a median income of more than 50,000 dollars, a total population of at least 5,000 people, the median rent less than 3,000 dollars, have nearby venues that are similar in terms of their business type but contain the least amount of competition, and lastly, be nearby at least one college and five other types of schools. With the requirements in place, I retrieved relevant data from the U.S. Census, Data SF, and FourSquare's API with the aims of finding tracts that met the requirements set out by the company. Several cleaning steps were needed after retrieving the data. After some filtering of the data,the original 197 tracts in SF were condensed to 18 that met the first three requirements previously metioned.

To filter by the last two requirements, I used the K-Means clustering technique and selected cluster 0 as the best representative of the company's business type. In order to fulfill the fourth requirement (see intro for reference), I looked through the dataframe and was able to filter out four tracts that contained Mexican restaraunts as one of their common venue categories. The remaining tracts were then mapped along with the schools and colleges in order to select the ones that met the last requirement (see intro for reference). The seven tracts that met the last requiremnt were then shown to the company as potential tracts for the business to look further into. This report proved to be beneficial as the company was able to conduct detailed analyses into the resulting nine tracts and ultimately, find the best tract to settle their business in.

In [ ]:
jovian.commit()
In [ ]: