Annual Cost of Living Monte Carlo Models


Cost of Living Projections

Introduction

I do not like negotiating for salary. Especially, without valid projections to determine a range.

I prepared this report to estimate a salary expectation that will maintain my current standard of living.

I present two Monte Carlo models of Houston and NYC annual living costs. The data is somewhat dated and –particularly in the case of houston– are high level estimates.

In order to produce a better report, I am currently scraping data from the internet for more accurate sample distributions. I will be able to present that soon.

With that said, the model should not deviate by more than about 5-10 percent from what is presented in below.

Findings

An annual salary of $90,000 would be sufficient to qualify for rent in Houston and most likely the median level income neighbors of NYC.

I came about this number by quantifying a confidence inverval of annual rent costs in boths cities across a normal distribution. I then simply multiplied that number by 3 in order to meet the lease qualifications of most landlords.

Limitations of the Model

Old Nyc Data

The data I am using was sourced from 2018. I will be updating it soon.

Houston Data

The houston estimate is based an estimate to stay in the property I am currently staying in. The rent is 2400 a month. I estimated that it could raise at maximum to about 2600 in the next year. If I were to move similiar housing goes for around 2200 to about 2600 a month. I used these as the bounds of my estimates

Houston Cost of Living Expenses

I intend to stay in Houston for the next year. I would like to move to NY eventually to be nearer to a central office, but not in the near future.

lower_bound = int(2400)
upper_bound = int(2600)

median = 2500
standard_dev = 100  #file:///Users/jnapolitano/Downloads/LNG_Shipping_a_Descriptive_Analysis.pdf

cap_range = range(lower_bound, upper_bound)

rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

rent_sample = choice(rent_distribution,12)

Houston Monthly food costs

lower_bound = int(300)
upper_bound = int(500)

median = 400
standard_dev = 50 

food_range = range(lower_bound, upper_bound)

food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

food_sample = choice(food_distribution, 12)

Houston Insurance Costs

lower_bound = int(200)
upper_bound = int(300)

median = 250
standard_dev = 25

insurance_range = range(lower_bound, upper_bound)

insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

The Houston Cost of Living DF

cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Houston Costs Per Annum Algorithm

The algorithm below calculates the annual cost of rent, food, and insurance to determine total cost per year. Rent, food, and insurance are set by random choice based on the distributions defined in the functions above.

I run the simulation 10,000 times which in theory corresponds to 10,000 random samples of annual costs. The point in doing this is to create a random normal distribution to define convidence intervals of my total annual costs.


years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0


for year in range(years):
    # Define a New DataFrame. It should fall out of scope with each iteration 
    cost_of_living_df = pd.DataFrame()
    #random choice of rent 
    cost_of_living_df['rent']= choice(rent_distribution,12)
    #random choice of food
    cost_of_living_df['food'] = choice(food_distribution, 12)
    #random Choice of Insurance
    cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
    #Random Choice of total annual cost
    cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
    # must use apply to account for multiple 0 conditions.  If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day 
    #calculate cost per day for fun...
    # query all that are = o.  Summate the capacities deduct the total 
    annual_cost = cost_of_living_df['monthly_cost'].sum()
    annual_rent = cost_of_living_df.rent.sum()
    annual_food = cost_of_living_df.food.sum()
    annual_insurance = cost_of_living_df.insurance.sum()
    cycle_price_samples[year] = annual_cost
    cycle_food_samples[year] = annual_food
    cycle_insurance_samples[year] = annual_insurance
    cycle_rent_samples[year] = annual_rent
    #print(carbon_total_millions_metric_tons)
    year_counter = year_counter+1

Houston Prediction Df

prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples
prediction_df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Houston Annual Cost Histogram

prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Houston: Calculating the Confidence Interval For Total Costs

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))
(37795.2942543157, 37808.287836034055)

Houston Annual Rent Histogram

### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Houston: Calculating the Confidence Interval For Annual Rent

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))
(29996.264715447538, 30009.767827637417)

New York Cost of Living Expenses

For the sake of comparison, the New York Expense distributions are calculated below. I assume that everything but rent will be equivalent to Houston. A more accurate model would account for insurance, food, and incidental differences.

I am assuming the rent of a two bedroom apartment.

The data i am using was scraped from craigslist in 2018. I will redo it later for 2022 data to get a better model.

nyc_df = pd.read_csv("/Users/jnapolitano/Projects/cost-of-living-projections/nyc-housing.csv", encoding="unicode-escape")
#assuiming a two bedroom
nyc_df = nyc_df[nyc_df['Bedrooms']== '2br']
nyc_df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

The price is about 2800 with a std of 7,465. Which is absurd. To do a better analysis, I need to clean the data.


idx = (nyc_df.Price > 500) & (nyc_df.Price < 4500)
nyc_df = nyc_df[idx]
nyc_df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

When accounting for outliers the data is far more managable. I’m surprised by the mean price. Again this data is old, but it is also does not accout for neighborhoods. I will redo the analysis at a later data filtered by neighborhoods.

Creating the NYC Distributions

lower_bound = int(600)
upper_bound = int(4500)

median = 2435
standard_dev = 729 

cap_range = range(lower_bound, upper_bound)

rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

rent_sample = choice(rent_distribution,12)

NYC Monthly food costs

lower_bound = int(300)
upper_bound = int(500)

median = 400
standard_dev = 50 

food_range = range(lower_bound, upper_bound)

food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

food_sample = choice(food_distribution, 12)

NYC Insurance Costs

lower_bound = int(200)
upper_bound = int(300)

median = 250
standard_dev = 25

insurance_range = range(lower_bound, upper_bound)

insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

NYC Cost of Living Distribution

cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

NYC Costs Per Annum Algorithm

The algorithm below calculates the annual cost of rent, food, and insurance to determine total cost per year. Rent, food, and insurance are set by random choice based on the distributions defined in the functions above.

I run the simulation 10,000 times which in theory corresponds to 10,000 random samples of annual costs. The point in doing this is to create a random normal distribution to define convidence intervals of my total annual costs.


years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0


for year in range(years):
    # Define a New DataFrame. It should fall out of scope with each iteration 
    cost_of_living_df = pd.DataFrame()
    #random choice of rent 
    cost_of_living_df['rent']= choice(rent_distribution,12)
    #random choice of food
    cost_of_living_df['food'] = choice(food_distribution, 12)
    #random Choice of Insurance
    cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
    #Random Choice of total annual cost
    cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
    # must use apply to account for multiple 0 conditions.  If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day 
    #calculate cost per day for fun...
    # query all that are = o.  Summate the capacities deduct the total 
    annual_cost = cost_of_living_df['monthly_cost'].sum()
    annual_rent = cost_of_living_df.rent.sum()
    annual_food = cost_of_living_df.food.sum()
    annual_insurance = cost_of_living_df.insurance.sum()
    cycle_price_samples[year] = annual_cost
    cycle_food_samples[year] = annual_food
    cycle_insurance_samples[year] = annual_insurance
    cycle_rent_samples[year] = annual_rent
    #print(carbon_total_millions_metric_tons)
    year_counter = year_counter+1

NYC Prediction Df

prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples
prediction_df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

NYC Annual Cost Histogram

prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

NYC: Calculating the Confidence Interval For Total Costs

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))
(36979.727235126586, 37063.36039733022)

NYC Annual Rent Histogram

### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Calculating the Confidence Interval For Annual Rent

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))
(29169.877514702926, 29269.14186706609)

NYC Closing Remarks

The rent distribution in NYC with 2018 data is actually nearly comparible to my houston estimate. An annual salary of 90,000 would permit me to live at about the median level in the city. I will be redoing this report soon as the data is old. I am currently scraping data in houston and nyc to produce a better analysis.

Imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
from shapely.geometry import Point
from numpy.random import choice
import warnings

warnings.filterwarnings('ignore')