Annual Cost of Living Monte Carlo Models
Monte Carlo Model to determine the cost of living in your city.
pythonnumerical-computingmonte-carlostatistics
7 Minutes, 14 Seconds
2022-06-01 15:24 +0000
Cost of Living Projections
Introduction
I do not like negotiating for salary. Especially, without valid projections to determine a range.
I prepared this report to estimate a salary expectation that will maintain my current standard of living.
I present two Monte Carlo models of Houston and NYC annual living costs. The data is somewhat dated and –particularly in the case of houston– are high level estimates.
In order to produce a better report, I am currently scraping data from the internet for more accurate sample distributions. I will be able to present that soon.
With that said, the model should not deviate by more than about 5-10 percent from what is presented in below.
Findings
An annual salary of $90,000 would be sufficient to qualify for rent in Houston and most likely the median level income neighbors of NYC.
I came about this number by quantifying a confidence inverval of annual rent costs in boths cities across a normal distribution. I then simply multiplied that number by 3 in order to meet the lease qualifications of most landlords.
Limitations of the Model
Old Nyc Data
The data I am using was sourced from 2018. I will be updating it soon.
Houston Data
The houston estimate is based an estimate to stay in the property I am currently staying in. The rent is 2400 a month. I estimated that it could raise at maximum to about 2600 in the next year. If I were to move similiar housing goes for around 2200 to about 2600 a month. I used these as the bounds of my estimates
Houston Cost of Living Expenses
I intend to stay in Houston for the next year. I would like to move to NY eventually to be nearer to a central office, but not in the near future.
lower_bound = int(2400)
upper_bound = int(2600)
median = 2500
standard_dev = 100 #file:///Users/jnapolitano/Downloads/LNG_Shipping_a_Descriptive_Analysis.pdf
cap_range = range(lower_bound, upper_bound)
rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
rent_sample = choice(rent_distribution,12)
Houston Monthly food costs
lower_bound = int(300)
upper_bound = int(500)
median = 400
standard_dev = 50
food_range = range(lower_bound, upper_bound)
food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
food_sample = choice(food_distribution, 12)
Houston Insurance Costs
lower_bound = int(200)
upper_bound = int(300)
median = 250
standard_dev = 25
insurance_range = range(lower_bound, upper_bound)
insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
The Houston Cost of Living DF
cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Houston Costs Per Annum Algorithm
The algorithm below calculates the annual cost of rent, food, and insurance to determine total cost per year. Rent, food, and insurance are set by random choice based on the distributions defined in the functions above.
I run the simulation 10,000 times which in theory corresponds to 10,000 random samples of annual costs. The point in doing this is to create a random normal distribution to define convidence intervals of my total annual costs.
years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0
for year in range(years):
# Define a New DataFrame. It should fall out of scope with each iteration
cost_of_living_df = pd.DataFrame()
#random choice of rent
cost_of_living_df['rent']= choice(rent_distribution,12)
#random choice of food
cost_of_living_df['food'] = choice(food_distribution, 12)
#random Choice of Insurance
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
#Random Choice of total annual cost
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
# must use apply to account for multiple 0 conditions. If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day
#calculate cost per day for fun...
# query all that are = o. Summate the capacities deduct the total
annual_cost = cost_of_living_df['monthly_cost'].sum()
annual_rent = cost_of_living_df.rent.sum()
annual_food = cost_of_living_df.food.sum()
annual_insurance = cost_of_living_df.insurance.sum()
cycle_price_samples[year] = annual_cost
cycle_food_samples[year] = annual_food
cycle_insurance_samples[year] = annual_insurance
cycle_rent_samples[year] = annual_rent
#print(carbon_total_millions_metric_tons)
year_counter = year_counter+1
Houston Prediction Df
prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples
prediction_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Houston Annual Cost Histogram
prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
Houston: Calculating the Confidence Interval For Total Costs
The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality
st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))
(37795.2942543157, 37808.287836034055)
Houston Annual Rent Histogram
### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
Houston: Calculating the Confidence Interval For Annual Rent
The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality
st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))
(29996.264715447538, 30009.767827637417)
New York Cost of Living Expenses
For the sake of comparison, the New York Expense distributions are calculated below. I assume that everything but rent will be equivalent to Houston. A more accurate model would account for insurance, food, and incidental differences.
I am assuming the rent of a two bedroom apartment.
The data i am using was scraped from craigslist in 2018. I will redo it later for 2022 data to get a better model.
nyc_df = pd.read_csv("/Users/jnapolitano/Projects/cost-of-living-projections/nyc-housing.csv", encoding="unicode-escape")
#assuiming a two bedroom
nyc_df = nyc_df[nyc_df['Bedrooms']== '2br']
nyc_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
The price is about 2800 with a std of 7,465. Which is absurd. To do a better analysis, I need to clean the data.
idx = (nyc_df.Price > 500) & (nyc_df.Price < 4500)
nyc_df = nyc_df[idx]
nyc_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
When accounting for outliers the data is far more managable. I’m surprised by the mean price. Again this data is old, but it is also does not accout for neighborhoods. I will redo the analysis at a later data filtered by neighborhoods.
Creating the NYC Distributions
lower_bound = int(600)
upper_bound = int(4500)
median = 2435
standard_dev = 729
cap_range = range(lower_bound, upper_bound)
rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
rent_sample = choice(rent_distribution,12)
NYC Monthly food costs
lower_bound = int(300)
upper_bound = int(500)
median = 400
standard_dev = 50
food_range = range(lower_bound, upper_bound)
food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
food_sample = choice(food_distribution, 12)
NYC Insurance Costs
lower_bound = int(200)
upper_bound = int(300)
median = 250
standard_dev = 25
insurance_range = range(lower_bound, upper_bound)
insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)
NYC Cost of Living Distribution
cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
NYC Costs Per Annum Algorithm
The algorithm below calculates the annual cost of rent, food, and insurance to determine total cost per year. Rent, food, and insurance are set by random choice based on the distributions defined in the functions above.
I run the simulation 10,000 times which in theory corresponds to 10,000 random samples of annual costs. The point in doing this is to create a random normal distribution to define convidence intervals of my total annual costs.
years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0
for year in range(years):
# Define a New DataFrame. It should fall out of scope with each iteration
cost_of_living_df = pd.DataFrame()
#random choice of rent
cost_of_living_df['rent']= choice(rent_distribution,12)
#random choice of food
cost_of_living_df['food'] = choice(food_distribution, 12)
#random Choice of Insurance
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
#Random Choice of total annual cost
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
# must use apply to account for multiple 0 conditions. If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day
#calculate cost per day for fun...
# query all that are = o. Summate the capacities deduct the total
annual_cost = cost_of_living_df['monthly_cost'].sum()
annual_rent = cost_of_living_df.rent.sum()
annual_food = cost_of_living_df.food.sum()
annual_insurance = cost_of_living_df.insurance.sum()
cycle_price_samples[year] = annual_cost
cycle_food_samples[year] = annual_food
cycle_insurance_samples[year] = annual_insurance
cycle_rent_samples[year] = annual_rent
#print(carbon_total_millions_metric_tons)
year_counter = year_counter+1
NYC Prediction Df
prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples
prediction_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
NYC Annual Cost Histogram
prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
NYC: Calculating the Confidence Interval For Total Costs
The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality
st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))
(36979.727235126586, 37063.36039733022)
NYC Annual Rent Histogram
### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
Calculating the Confidence Interval For Annual Rent
The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality
st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))
(29169.877514702926, 29269.14186706609)
NYC Closing Remarks
The rent distribution in NYC with 2018 data is actually nearly comparible to my houston estimate. An annual salary of 90,000 would permit me to live at about the median level in the city. I will be redoing this report soon as the data is old. I am currently scraping data in houston and nyc to produce a better analysis.
Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
from shapely.geometry import Point
from numpy.random import choice
import warnings
warnings.filterwarnings('ignore')