Airbnb Zillow House Listings¶

1. Setting up the Notebook for Project

1.1 Set Parameters

1.2 Import Libraries

1.3 Set Working Directory

2. Get File Names & Read Raw Data¶

3. Preprocess & Analyze Listings Data

3.1 Function to Preprocess Listings Data

3.2 Function to EDA Listings Data

4. Preprocess & Analyze Home Value Data

4.1 Function to Preprocess Home Value Data

4.2 Function to EDA Home Value Data

5. Profit Analysis for Different Time Horizons

5.1 Function to analyze profit estimations for various time horizons

6. Conclusion

6.1 Insights

6.2 Data Quality Checks & Derived Fields

6.3 Methodology

6.4 Next Steps

1. Setting up the Notebook for Project¶

1.1 Set Parameters¶

In [1]:
#provide project path which contains the raw data
project_path = "C:\\Users\\gusahil\\Downloads\\One"

#provide city in lower case
project_city = "new york"

#provide key columns for listings data relevant to analysis
listings_columns = ['id','last_scraped','host_id','neighbourhood','neighbourhood_cleansed','neighbourhood_group_cleansed', \
                    'market','city','zipcode','state','property_type','room_type','accommodates','availability_30','availability_60', \
                    'availability_90','availability_365','bathrooms','bedrooms','bed_type','beds','square_feet','price', \
                    'weekly_price','monthly_price','security_deposit']

1.2 Import Libraries¶

In [2]:
import pandas as pd
import numpy as np
from time import gmtime, strftime #DateTime Formatting
import matplotlib as mpl
import matplotlib.pyplot as plt #Basic Plotting
import seaborn as sns #Advanced Plotting
import plotly.express as px #Advanced Plotting
sns.set_theme(style="whitegrid")
import datetime
import warnings
import sklearn
import xgboost


#from yellowbrick.regressor import PredictionError
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.3f}'.format
import pandas_profiling

1.3 Set Working Directory¶

In [3]:
import os
orig_dir = os.getcwd()
os.chdir(project_path)
os.getcwd()
Out[3]:
'C:\\Users\\gusahil\\Downloads\\One'
In [4]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

2. Get File Names & Read Raw Data¶

In [5]:
def get_filenames(path = project_path):
    listings = []
    for file in os.listdir(path):
        if file.startswith("listings file"):
            listings.append(file)
            

    home_value = []
    for file in os.listdir(path):
        if file.startswith("Zip_Zhvi"):
            home_value.append(file)
            
    return(np.sort(listings), np.sort(home_value))
            
listings, home_value = get_filenames()
In [6]:
def read_filenames(path = project_path):
    listings_df = pd.read_csv(listings[0], sep=",")
    print("read:", listings[0])
    for file in listings[1:]:
        print("read:", file)
        file_df = pd.read_csv(file, sep=",")
        listings_df.append(file_df)
        
    home_value_df = pd.read_csv(home_value[0], sep=",")
    
    print("read:", home_value[0])
    for file in home_value[1:]:
        print("read:", file)
        file_df = pd.read_csv(file, sep=",")
        home_value_df.append(file_df)
            
    return(listings_df, home_value_df)
            
listings_df, home_value_df = read_filenames()
read: listings file 1 of 4.csv
read: listings file 2 of 4.csv
read: listings file 3 of 4.csv
read: listings file 4 of 4.csv
read: Zip_Zhvi_2bedroom.csv

3. Preprocess & Analyze Listings Data¶

3.1 Function to Preprocess Listings Data¶

In [7]:
listings_df.head()
Out[7]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules thumbnail_url medium_url picture_url xl_picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city state zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet price weekly_price monthly_price security_deposit cleaning_fee guests_included extra_people minimum_nights maximum_nights calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value requires_license license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
0 7949480 https://www.airbnb.com/rooms/7949480 20200000000000.000 5/3/17 City Island Sanctuary relaxing BR & Bath w Par... Come relax on City Island in our quiet guest r... On parle français et anglais, (lire Français... Come relax on City Island in our quiet guest r... none City Island is a unique sanctuary in New York ... We are walkable to Pelham Park which is the la... On the island, you can walk to everything. A ... Feel free to enjoy our large backyard in seaso... We speak fluent French and English. We will b... No extra guests or visitors who are not in the... https://a0.muscache.com/im/pictures/001d1cb0-6... https://a0.muscache.com/im/pictures/001d1cb0-6... https://a0.muscache.com/im/pictures/001d1cb0-6... https://a0.muscache.com/im/pictures/001d1cb0-6... 119445 https://www.airbnb.com/users/show/119445 Linda & Didier 5/6/10 New York, New York, United States I am a designer, former instructor at a design... within an hour 100% NaN t https://a0.muscache.com/im/pictures/2de30fe4-b... https://a0.muscache.com/im/pictures/2de30fe4-b... City Island 1.000 1.000 ['email', 'phone', 'reviews', 'jumio'] t t City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.852 -73.789 t House Private room 2 1.000 1.000 1.000 Real Bed {"Cable TV","Wireless Internet","Air condition... NaN $99.00 NaN NaN $100.00 NaN 1 $20.00 1 7 yesterday NaN 24 54 80 170 5/3/17 25 1/18/16 4/23/17 100.000 10.000 10.000 10.000 10.000 10.000 10.000 f NaN NaN f moderate t t 1 1.590
1 16042478 https://www.airbnb.com/rooms/16042478 20200000000000.000 5/4/17 WATERFRONT STUDIO APARTMENT My place is close to Sea Shore. You’ll love ... (URL HIDDEN) My place is close to Sea Shore. You’ll love ... none (URL HIDDEN) Fine sea food restaurant, bars and night clubs... NaN public transport from Grand Central, NYC or fr... NaN NaN https://a0.muscache.com/im/pictures/bb5bc3c4-3... https://a0.muscache.com/im/pictures/bb5bc3c4-3... https://a0.muscache.com/im/pictures/bb5bc3c4-3... https://a0.muscache.com/im/pictures/bb5bc3c4-3... 9117975 https://www.airbnb.com/users/show/9117975 Collins 9/29/13 New York, New York, United States I am married with 3 children and 2 grandkid. ... a few days or more 0% NaN f https://a0.muscache.com/im/users/9117975/profi... https://a0.muscache.com/im/users/9117975/profi... City Island 1.000 1.000 ['phone', 'facebook'] t f City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.853 -73.789 t Apartment Private room 4 1.000 1.000 1.000 Real Bed {TV,Internet,"Wireless Internet","Air conditio... NaN $200.00 NaN NaN NaN NaN 1 $0.00 7 29 6 months ago NaN 30 60 90 180 5/4/17 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN f NaN NaN t flexible f f 1 NaN
2 1886820 https://www.airbnb.com/rooms/1886820 20200000000000.000 5/4/17 Quaint City Island Community. Quiet island boating town on Long Island Soun... Master bed with queen bed, full bath and offi... Quiet island boating town on Long Island Soun... none Small New England type town in the middle of ... NaN Parking available in driveway. Bus stop aroun... 2 story 1800 sq ft house all to yourselves. Local caretaker available during stay. We will... No smoking. No pets. This is our private hom... NaN NaN https://a0.muscache.com/im/pictures/26266928/1... NaN 9815788 https://www.airbnb.com/users/show/9815788 Steve 11/4/13 US NaN NaN NaN NaN f https://a0.muscache.com/im/users/9815788/profi... https://a0.muscache.com/im/users/9815788/profi... City Island 1.000 1.000 ['email', 'phone'] t f City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.841 -73.783 t House Entire home/apt 4 3.000 3.000 3.000 Real Bed {TV,"Cable TV",Internet,"Wireless Internet","A... NaN $300.00 NaN NaN $800.00 $100.00 4 $25.00 7 90 11 months ago NaN 30 60 90 365 5/4/17 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN f NaN NaN f strict f f 1 NaN
3 6627449 https://www.airbnb.com/rooms/6627449 20200000000000.000 5/5/17 Large 1 BDRM in Great location This ground floor apartment is light and airy ... We are close to fishing, boating, biking, hors... This ground floor apartment is light and airy ... none City Island is a unique and a hidden gem of Ne... There is off street parking in the apartment c... We are (2) blocks from the BX29 bus which you ... The entire apartment is available as I am only... My daughter and her family live on the Island ... There is no smoking and no pets. Otherwise us... https://a0.muscache.com/im/pictures/83539171/1... https://a0.muscache.com/im/pictures/83539171/1... https://a0.muscache.com/im/pictures/83539171/1... https://a0.muscache.com/im/pictures/83539171/1... 13886510 https://www.airbnb.com/users/show/13886510 Arlene 4/4/14 Los Angeles, California, United States NaN within a few hours 100% NaN f https://a0.muscache.com/im/users/13886510/prof... https://a0.muscache.com/im/users/13886510/prof... City Island 1.000 1.000 ['email', 'phone', 'reviews', 'kba'] t t City Island, City Island, NY 10464, United States City Island City Island Bronx City Island NY 10464 New York City Island, NY US United States 40.850 -73.787 t Apartment Entire home/apt 3 1.000 1.000 1.000 Real Bed {TV,Internet,"Wireless Internet","Air conditio... NaN $125.00 $775.00 NaN NaN $75.00 1 $0.00 3 21 2 weeks ago NaN 8 30 60 335 5/5/17 12 7/4/15 10/24/16 93.000 10.000 10.000 10.000 10.000 10.000 10.000 f NaN NaN f strict f f 1 0.540
4 5557381 https://www.airbnb.com/rooms/5557381 20200000000000.000 5/4/17 Quaint City Island Home Located in an old sea-shanty town, our home ha... You won't find a place so close to the city (N... Located in an old sea-shanty town, our home ha... none City Island is unique in two ways. First, you ... I am going to put this in a few places (also a... The easiest way to get to Manhattan quickly is... The house is just a place to sleep (and cook a... I live in the back house (there are two houses... Only three actual rules: 1. Don't smoke in the... https://a0.muscache.com/im/pictures/85805287/7... https://a0.muscache.com/im/pictures/85805287/7... https://a0.muscache.com/im/pictures/85805287/7... https://a0.muscache.com/im/pictures/85805287/7... 28811542 https://www.airbnb.com/users/show/28811542 Phoebe 3/5/15 New York, New York, United States My name is Phoebe. Both my parents are biologi... within an hour 100% NaN f https://a0.muscache.com/im/pictures/43cb9adc-f... https://a0.muscache.com/im/pictures/43cb9adc-f... City Island 1.000 1.000 ['email', 'phone', 'reviews', 'jumio'] t t City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.850 -73.789 t House Private room 4 1.000 1.000 1.000 Real Bed {Internet,"Wireless Internet","Air conditionin... NaN $69.00 $350.00 $1,200.00 NaN $17.00 2 $15.00 3 1125 yesterday NaN 17 47 77 352 5/4/17 86 5/25/15 4/22/17 97.000 10.000 10.000 10.000 10.000 10.000 10.000 f NaN NaN t moderate f f 1 3.630
In [8]:
#listings_df['city'][listings_df['city'].astype(str).str.contains('new york', case=False, regex=True)]
In [9]:
listings_df.describe(include = 'all').T
Out[9]:
count unique top freq mean std min 25% 50% 75% max
id 9999.000 NaN NaN NaN 9968478.863 5585354.700 3831.000 5066713.000 10062378.000 15058841.500 18516880.000
listing_url 9999 9999 https://www.airbnb.com/rooms/16301294 1 NaN NaN NaN NaN NaN NaN NaN
scrape_id 9999.000 NaN NaN NaN 20200000000000.000 0.000 20200000000000.000 20200000000000.000 20200000000000.000 20200000000000.000 20200000000000.000
last_scraped 9999 4 5/4/17 4228 NaN NaN NaN NaN NaN NaN NaN
name 9990 9859 DRIFTER INN Rockaway Beach 12 NaN NaN NaN NaN NaN NaN NaN
summary 9605 9272 Enjoy Rockaway Beach at DRIFTER, and intention... 17 NaN NaN NaN NaN NaN NaN NaN
space 6929 6726 You'll feel at home at the DRIFTER INN, a quin... 11 NaN NaN NaN NaN NaN NaN NaN
description 9996 9852 Enjoy Rockaway Beach at DRIFTER, and intention... 10 NaN NaN NaN NaN NaN NaN NaN
experiences_offered 9999 1 none 9999 NaN NaN NaN NaN NaN NaN NaN
neighborhood_overview 5948 5410 Bedford-Stuyvesant, popularly known as Bed-Stu... 22 NaN NaN NaN NaN NaN NaN NaN
notes 3644 3349 #NAME? 14 NaN NaN NaN NaN NaN NaN NaN
transit 6476 5954 We are just around the corner from the G train... 12 NaN NaN NaN NaN NaN NaN NaN
access 6021 5450 #NAME? 44 NaN NaN NaN NaN NaN NaN NaN
interaction 5755 5281 You will be sharing the space with us and othe... 12 NaN NaN NaN NaN NaN NaN NaN
house_rules 6551 5575 #NAME? 537 NaN NaN NaN NaN NaN NaN NaN
thumbnail_url 7901 7897 https://a0.muscache.com/im/pictures/53c4d6d0-d... 3 NaN NaN NaN NaN NaN NaN NaN
medium_url 7901 7897 https://a0.muscache.com/im/pictures/53c4d6d0-d... 3 NaN NaN NaN NaN NaN NaN NaN
picture_url 9999 9994 https://a0.muscache.com/im/pictures/4491e5c5-3... 3 NaN NaN NaN NaN NaN NaN NaN
xl_picture_url 7901 7897 https://a0.muscache.com/im/pictures/53c4d6d0-d... 3 NaN NaN NaN NaN NaN NaN NaN
host_id 9999.000 NaN NaN NaN 29983005.739 32441138.463 2571.000 4620335.000 17793315.000 44882137.500 128441579.000
host_url 9999 8285 https://www.airbnb.com/users/show/8874674 18 NaN NaN NaN NaN NaN NaN NaN
host_name 9928 3780 Michael 102 NaN NaN NaN NaN NaN NaN NaN
host_since 9928 2388 9/16/13 27 NaN NaN NaN NaN NaN NaN NaN
host_location 9892 405 New York, New York, United States 8051 NaN NaN NaN NaN NaN NaN NaN
host_about 6441 5080 I'm a British/Canadian/American tri-citizen ha... 18 NaN NaN NaN NaN NaN NaN NaN
host_response_time 6840 4 within an hour 3452 NaN NaN NaN NaN NaN NaN NaN
host_response_rate 6840 72 100% 4845 NaN NaN NaN NaN NaN NaN NaN
host_acceptance_rate 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
host_is_superhost 9928 2 f 8891 NaN NaN NaN NaN NaN NaN NaN
host_thumbnail_url 9928 8220 https://a0.muscache.com/defaults/user_pic-50x5... 27 NaN NaN NaN NaN NaN NaN NaN
host_picture_url 9928 8220 https://a0.muscache.com/defaults/user_pic-225x... 27 NaN NaN NaN NaN NaN NaN NaN
host_neighbourhood 8786 146 Bedford-Stuyvesant 2288 NaN NaN NaN NaN NaN NaN NaN
host_listings_count 9928.000 NaN NaN NaN 1.942 2.671 0.000 1.000 1.000 2.000 47.000
host_total_listings_count 9928.000 NaN NaN NaN 1.942 2.671 0.000 1.000 1.000 2.000 47.000
host_verifications 9999 279 ['email', 'phone', 'reviews', 'kba'] 1694 NaN NaN NaN NaN NaN NaN NaN
host_has_profile_pic 9928 2 t 9901 NaN NaN NaN NaN NaN NaN NaN
host_identity_verified 9928 2 t 6708 NaN NaN NaN NaN NaN NaN NaN
street 9999 528 Bushwick, Brooklyn, NY 11237, United States 685 NaN NaN NaN NaN NaN NaN NaN
neighbourhood 8590 79 Bedford-Stuyvesant 2282 NaN NaN NaN NaN NaN NaN NaN
neighbourhood_cleansed 9999 52 Bedford-Stuyvesant 2850 NaN NaN NaN NaN NaN NaN NaN
neighbourhood_group_cleansed 9999 5 Brooklyn 6994 NaN NaN NaN NaN NaN NaN NaN
city 9989 115 Brooklyn 6582 NaN NaN NaN NaN NaN NaN NaN
state 9999 3 NY 9997 NaN NaN NaN NaN NaN NaN NaN
zipcode 9821 132 11221 1368 NaN NaN NaN NaN NaN NaN NaN
market 9950 9 New York 9934 NaN NaN NaN NaN NaN NaN NaN
smart_location 9999 117 Brooklyn, NY 6581 NaN NaN NaN NaN NaN NaN NaN
country_code 9999 2 US 9998 NaN NaN NaN NaN NaN NaN NaN
country 9999 2 United States 9998 NaN NaN NaN NaN NaN NaN NaN
latitude 9999.000 NaN NaN NaN 40.703 0.040 40.549 40.684 40.693 40.716 40.876
longitude 9999.000 NaN NaN NaN -73.947 0.041 -74.172 -73.968 -73.943 -73.922 -73.713
is_location_exact 9999 2 t 8572 NaN NaN NaN NaN NaN NaN NaN
property_type 9999 17 Apartment 8153 NaN NaN NaN NaN NaN NaN NaN
room_type 9999 3 Private room 5308 NaN NaN NaN NaN NaN NaN NaN
accommodates 9999.000 NaN NaN NaN 2.821 1.951 1.000 2.000 2.000 4.000 16.000
bathrooms 9947.000 NaN NaN NaN 1.162 0.425 0.000 1.000 1.000 1.000 6.000
bedrooms 9988.000 NaN NaN NaN 1.181 0.704 0.000 1.000 1.000 1.000 9.000
beds 9975.000 NaN NaN NaN 1.560 1.123 1.000 1.000 1.000 2.000 16.000
bed_type 9999 5 Real Bed 9712 NaN NaN NaN NaN NaN NaN NaN
amenities 9999 9199 {} 67 NaN NaN NaN NaN NaN NaN NaN
square_feet 127.000 NaN NaN NaN 825.071 713.377 0.000 380.000 775.000 1000.000 5000.000
price 9999 358 $50.00 452 NaN NaN NaN NaN NaN NaN NaN
weekly_price 2404 400 $500.00 113 NaN NaN NaN NaN NaN NaN NaN
monthly_price 1955 359 $1,200.00 92 NaN NaN NaN NaN NaN NaN NaN
security_deposit 4189 108 $100.00 922 NaN NaN NaN NaN NaN NaN NaN
cleaning_fee 6811 117 $50.00 898 NaN NaN NaN NaN NaN NaN NaN
guests_included 9999.000 NaN NaN NaN 1.488 1.100 1.000 1.000 1.000 2.000 16.000
extra_people 9999 68 $0.00 5227 NaN NaN NaN NaN NaN NaN NaN
minimum_nights 9999.000 NaN NaN NaN 3.797 12.398 1.000 1.000 2.000 3.000 1000.000
maximum_nights 9999.000 NaN NaN NaN 879.763 15862.601 1.000 30.000 1125.000 1125.000 1231231.000
calendar_updated 9999 62 today 1152 NaN NaN NaN NaN NaN NaN NaN
has_availability 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
availability_30 9999.000 NaN NaN NaN 7.575 10.220 0.000 0.000 2.000 12.000 30.000
availability_60 9999.000 NaN NaN NaN 18.917 21.484 0.000 0.000 9.000 35.000 60.000
availability_90 9999.000 NaN NaN NaN 32.573 33.310 0.000 0.000 22.000 62.000 90.000
availability_365 9999.000 NaN NaN NaN 142.272 139.324 0.000 0.000 90.000 290.000 365.000
calendar_last_scraped 9999 4 5/3/17 5369 NaN NaN NaN NaN NaN NaN NaN
number_of_reviews 9999.000 NaN NaN NaN 16.564 29.587 0.000 1.000 4.000 19.000 335.000
first_review 7698 1607 1/1/17 64 NaN NaN NaN NaN NaN NaN NaN
last_review 7725 783 4/16/17 309 NaN NaN NaN NaN NaN NaN NaN
review_scores_rating 7545.000 NaN NaN NaN 93.046 8.292 20.000 90.000 95.000 100.000 100.000
review_scores_accuracy 7527.000 NaN NaN NaN 9.554 0.813 2.000 9.000 10.000 10.000 10.000
review_scores_cleanliness 7534.000 NaN NaN NaN 9.196 1.108 2.000 9.000 10.000 10.000 10.000
review_scores_checkin 7520.000 NaN NaN NaN 9.714 0.678 2.000 10.000 10.000 10.000 10.000
review_scores_communication 7531.000 NaN NaN NaN 9.740 0.663 2.000 10.000 10.000 10.000 10.000
review_scores_location 7519.000 NaN NaN NaN 9.221 0.922 2.000 9.000 9.000 10.000 10.000
review_scores_value 7516.000 NaN NaN NaN 9.353 0.868 2.000 9.000 10.000 10.000 10.000
requires_license 9999 1 f 9999 NaN NaN NaN NaN NaN NaN NaN
license 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
jurisdiction_names 2 2 Colorado State 1 NaN NaN NaN NaN NaN NaN NaN
instant_bookable 9999 2 f 7910 NaN NaN NaN NaN NaN NaN NaN
cancellation_policy 9999 5 strict 4352 NaN NaN NaN NaN NaN NaN NaN
require_guest_profile_picture 9999 2 f 9637 NaN NaN NaN NaN NaN NaN NaN
require_guest_phone_verification 9999 2 f 9612 NaN NaN NaN NaN NaN NaN NaN
calculated_host_listings_count 9999.000 NaN NaN NaN 1.779 2.157 1.000 1.000 1.000 2.000 35.000
reviews_per_month 7698.000 NaN NaN NaN 1.324 1.492 0.020 0.250 0.780 1.990 26.150
In [10]:
def preprocess_listings(df = listings_df, imputation_columns = ['square_feet','price','weekly_price','monthly_price','security_deposit'], format_columns=['zipcode','host_id','bathrooms','bedrooms']):
    
    #Drop redundant columns
    df = df[listings_columns]
    df = df[~df['zipcode'].isnull()]
    df['bathrooms'] = df['bathrooms'].fillna(0)
    
    
    #Remove "$" & "," from price columns
    price_columns = [i for i in imputation_columns if ('price' in i) or ('deposit' in i)]
    
    for column in price_columns:
        df[column] = np.round(df[column].apply(lambda x: str(x).replace("$","").replace(",","")).astype(float),0)
        
    #Format columns
    for column in format_columns:
        df[column] = df[column].astype(str).apply(lambda x: x.split(".")[0].split("-")[0])
        
    #Impute missing values
    for column in imputation_columns:
        df_available = df[~df[column].isnull()]
        df_groupby = df_available.groupby(['bathrooms','bedrooms','beds','property_type'])[column].median().reset_index().rename(columns={column: column+'_impute'})
    
        df = pd.merge(df, df_groupby[['bathrooms','bedrooms','beds','property_type',column+'_impute']], on = ['bathrooms','bedrooms','beds','property_type'], how='left')
        df.loc[df[column].isnull(), column] = df.loc[df[column].isnull(), column+'_impute']
        df.drop([column+'_impute'], axis=1, inplace=True)
    
    #Create occupancy columns
    df['occupancy_30'] = np.round((30 - df['availability_30'])*100 / 30,0)
    df['occupancy_60'] = np.round((60 - df['availability_60'])*100 / 60,0)
    df['occupancy_90'] = np.round((90 - df['availability_90'])*100 / 90,0)
    df['occupancy_365'] = np.round((365 - df['availability_365'])*100 / 365,0)
    df['estimate_occupancy'] = df[['occupancy_30','occupancy_60','occupancy_90','occupancy_365']].max(axis=1)

    return(df[(df['city'].astype(str).str.lower() == project_city) & (df['bedrooms'] == '2')])
    #return(df[(listings_df['city'].astype(str).str.contains(project_city, case=False, regex=True)) & (df['bedrooms'] == '2')])
    
preprocess_listings_df = preprocess_listings()
preprocess_listings_df.head()
Out[10]:
id last_scraped host_id neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed market city zipcode state property_type room_type accommodates availability_30 availability_60 availability_90 availability_365 bathrooms bedrooms bed_type beds square_feet price weekly_price monthly_price security_deposit occupancy_30 occupancy_60 occupancy_90 occupancy_365 estimate_occupancy
33 4357134 5/5/17 7809661 Astoria Ditmars Steinway Queens New York New York 11102 NY Apartment Entire home/apt 4 0 0 0 0 1 2 Real Bed 2.000 950.000 200.000 999.500 3000.000 250.000 100.000 100.000 100.000 100.000 100.000
486 13612947 5/5/17 76332312 NaN Astoria Queens New York New York 11106 NY Apartment Entire home/apt 4 29 59 89 364 1 2 Real Bed 3.000 900.000 150.000 1000.000 3375.000 100.000 3.000 2.000 1.000 0.000 3.000
781 18430507 5/3/17 127732143 Columbia Street Waterfront Astoria Queens New York New York 11102 NY Apartment Entire home/apt 4 29 59 89 89 2 2 Real Bed 2.000 1300.000 290.000 1300.000 4647.500 500.000 3.000 2.000 1.000 76.000 76.000
1008 294239 5/5/17 1521432 Astoria Astoria Queens New York New York 11106 NY Apartment Entire home/apt 6 0 5 20 20 1 2 Real Bed 2.000 950.000 200.000 1050.000 3000.000 250.000 100.000 92.000 78.000 95.000 100.000
1247 9944426 5/4/17 51105827 Battery Park City Battery Park City Manhattan New York New York 10006 NY Apartment Entire home/apt 5 27 57 87 362 1 2 Real Bed 2.000 950.000 225.000 999.500 3000.000 250.000 10.000 5.000 3.000 1.000 10.000
In [11]:
preprocess_listings_df.describe(include = 'all').T
Out[11]:
count unique top freq mean std min 25% 50% 75% max
id 199.000 NaN NaN NaN 9405259.678 5932849.399 61224.000 4412452.500 9804368.000 15235137.000 18499691.000
last_scraped 199 4 5/3/17 82 NaN NaN NaN NaN NaN NaN NaN
host_id 199 199 6519939 1 NaN NaN NaN NaN NaN NaN NaN
neighbourhood 166 13 Chelsea 93 NaN NaN NaN NaN NaN NaN NaN
neighbourhood_cleansed 199 8 Chelsea 131 NaN NaN NaN NaN NaN NaN NaN
neighbourhood_group_cleansed 199 3 Manhattan 192 NaN NaN NaN NaN NaN NaN NaN
market 199 1 New York 199 NaN NaN NaN NaN NaN NaN NaN
city 199 1 New York 199 NaN NaN NaN NaN NaN NaN NaN
zipcode 199 16 10011 81 NaN NaN NaN NaN NaN NaN NaN
state 199 1 NY 199 NaN NaN NaN NaN NaN NaN NaN
property_type 199 5 Apartment 184 NaN NaN NaN NaN NaN NaN NaN
room_type 199 2 Entire home/apt 194 NaN NaN NaN NaN NaN NaN NaN
accommodates 199.000 NaN NaN NaN 4.683 1.597 1.000 4.000 4.000 6.000 16.000
availability_30 199.000 NaN NaN NaN 7.683 9.778 0.000 0.000 3.000 11.500 30.000
availability_60 199.000 NaN NaN NaN 19.382 20.845 0.000 0.000 12.000 33.000 60.000
availability_90 199.000 NaN NaN NaN 32.774 32.423 0.000 0.000 22.000 62.000 90.000
availability_365 199.000 NaN NaN NaN 146.417 136.954 0.000 0.000 121.000 277.000 365.000
bathrooms 199 4 1 142 NaN NaN NaN NaN NaN NaN NaN
bedrooms 199 1 2 199 NaN NaN NaN NaN NaN NaN NaN
bed_type 199 1 Real Bed 199 NaN NaN NaN NaN NaN NaN NaN
beds 198.000 NaN NaN NaN 2.470 0.938 1.000 2.000 2.000 3.000 6.000
square_feet 175.000 NaN NaN NaN 954.000 195.895 600.000 900.000 950.000 950.000 2000.000
price 199.000 NaN NaN NaN 347.196 219.317 65.000 202.500 299.000 405.000 1650.000
weekly_price 187.000 NaN NaN NaN 1268.083 655.247 685.000 999.500 1000.000 1300.000 5000.000
monthly_price 187.000 NaN NaN NaN 3942.436 1714.247 2200.000 3000.000 3375.000 4100.000 16000.000
security_deposit 199.000 NaN NaN NaN 523.028 638.456 95.000 250.000 300.000 500.000 5100.000
occupancy_30 199.000 NaN NaN NaN 74.377 32.615 0.000 61.500 90.000 100.000 100.000
occupancy_60 199.000 NaN NaN NaN 67.683 34.725 0.000 45.000 80.000 100.000 100.000
occupancy_90 199.000 NaN NaN NaN 63.583 36.036 0.000 31.000 76.000 100.000 100.000
occupancy_365 199.000 NaN NaN NaN 59.899 37.533 0.000 24.000 67.000 100.000 100.000
estimate_occupancy 199.000 NaN NaN NaN 79.673 28.644 0.000 73.000 90.000 100.000 100.000

3.2 Function to EDA Listings Data¶

In [12]:
def eda_listings(df = preprocess_listings_df, bar_plot = ['neighbourhood_cleansed','zipcode','bedrooms','bathrooms','property_type'], box_plot = ['neighbourhood_group_cleansed','neighbourhood_cleansed','zipcode','bedrooms','bathrooms','property_type']):
   
    for column in box_plot:  
        df_pivot = pd.pivot_table(df, values=['estimate_occupancy'], index=[column], aggfunc=('median',min,max,'count'), fill_value=0)
        df_pivot = df_pivot[df_pivot['estimate_occupancy']['count'] >= 5]
        print("Overall Occupancy Estimate:", np.median(df['estimate_occupancy']))
        print(df_pivot.reindex(df_pivot['estimate_occupancy'].sort_values(by='median', ascending=False).index))

        fig = px.box(df, y="estimate_occupancy", color=column, facet_col=column, boxmode="overlay", points='all', title="2-Bedroom Occupancy by "+column)
        for anno in fig['layout']['annotations']:
            anno['text']=''
        fig.show() 
        
    for column in box_plot:    
        fig = px.box(df, y="price", color=column, facet_col=column, boxmode="overlay", points='all', title="2-Bedroom Prices by "+column)
        for anno in fig['layout']['annotations']:
            anno['text']=''
        fig.show()    
    
    for column in bar_plot:
        groupby_df = df.groupby([column,'neighbourhood_group_cleansed'])['id'].count()\
                                .reset_index().sort_values(by=['id'], ascending=False).reset_index(drop = True)
        
        groupby_df[column] = groupby_df[column].astype(str)
        
        fig = px.bar(groupby_df.head(20), x=column, y="id", color='neighbourhood_group_cleansed', title="2-Bedroom Listings Count by "+column, labels={'id':'Count'})
        fig.show()
        
eda_listings()
Overall Occupancy Estimate: 90.0
                             estimate_occupancy                
                                          count  max median min
neighbourhood_group_cleansed                                   
Manhattan                                   192  100     90   0
Overall Occupancy Estimate: 90.0
                       estimate_occupancy                
                                    count  max median min
neighbourhood_cleansed                                   
Civic Center                            5  100     98   0
Chelsea                               131  100     93   0
Chinatown                              49  100     90   3
Battery Park City                       7  100     87   0
Overall Occupancy Estimate: 90.0
        estimate_occupancy                
                     count  max median min
zipcode                                   
10011                   81  100 94.000   0
10002                   32  100 90.000   3
10001                   42  100 89.500   0
10013                   16  100 88.000   0
10280                    5  100 87.000   0
10038                    6  100 85.500  30
Overall Occupancy Estimate: 90.0
         estimate_occupancy                
                      count  max median min
bedrooms                                   
2                       199  100     90   0
Overall Occupancy Estimate: 90.0
          estimate_occupancy                
                       count  max median min
bathrooms                                   
2                         54  100     94   0
1                        142  100     90   0
Overall Occupancy Estimate: 90.0
              estimate_occupancy                
                           count  max median min
property_type                                   
Apartment                    184  100 93.000   0
Loft                           8  100 52.500   3
  • We see 2-Bedroom apartments with 2-Bathrooms have a higher occupancy rate than 1-Bathrooms (94 versus 90). Hence, likely most residents are single professionals who prefer two bathrooms. The real estate company should consider this.
  • We see the real state company will be able to charge a much higher rent for 2-Bedroom 2-Bathroom apartments versus 2-Bedroom 1-Bathroom apartments (USD 482 versus USD 250).
  • Majority of the listings are in Chelsea, China Town and Battery Park. The real state company should focus here and check if any good deals available in these population centers within Manhattan.
  • The real estate company should focus on apartments as they have much higher occupancy than lofts, condos and house.

4. Preprocess & Analyze Home Value Data¶

4.1 Function to Preprocess Home Value Data¶

In [13]:
county_neighborood_mapping = preprocess_listings_df[['zipcode','neighbourhood_group_cleansed']].drop_duplicates().reset_index(drop=True)
county_neighborood_mapping.sort_values(by='zipcode').head()
Out[13]:
zipcode neighbourhood_group_cleansed
13 10001 Manhattan
8 10002 Manhattan
14 10003 Manhattan
3 10004 Manhattan
2 10006 Manhattan
In [14]:
home_value_df.head()
Out[14]:
RegionID RegionName City State Metro CountyName SizeRank 1996-04 1996-05 1996-06 1996-07 1996-08 1996-09 1996-10 1996-11 1996-12 1997-01 1997-02 1997-03 1997-04 1997-05 1997-06 1997-07 1997-08 1997-09 1997-10 1997-11 1997-12 1998-01 1998-02 1998-03 1998-04 1998-05 1998-06 1998-07 1998-08 1998-09 1998-10 1998-11 1998-12 1999-01 1999-02 1999-03 1999-04 1999-05 1999-06 1999-07 1999-08 1999-09 1999-10 1999-11 1999-12 2000-01 2000-02 2000-03 2000-04 2000-05 2000-06 2000-07 2000-08 2000-09 2000-10 2000-11 2000-12 2001-01 2001-02 2001-03 2001-04 2001-05 2001-06 2001-07 2001-08 2001-09 2001-10 2001-11 2001-12 2002-01 2002-02 2002-03 2002-04 2002-05 2002-06 2002-07 2002-08 2002-09 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 2003-07 2003-08 2003-09 2003-10 2003-11 2003-12 2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2004-12 2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03 2006-04 2006-05 2006-06 2006-07 2006-08 2006-09 2006-10 2006-11 2006-12 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08 2007-09 2007-10 2007-11 2007-12 2008-01 2008-02 2008-03 2008-04 2008-05 2008-06 2008-07 2008-08 2008-09 2008-10 2008-11 2008-12 2009-01 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10 2009-11 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10 2011-11 2011-12 2012-01 2012-02 2012-03 2012-04 2012-05 2012-06 2012-07 2012-08 2012-09 2012-10 2012-11 2012-12 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06
0 61627 10013 New York NY New York New York 1744 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1428000.000 1444100.000 1458700.000 1502700.000 1570100.000 1638000.000 1687000.000 1718300.000 1747600.000 1753300.000 1706600.000 1660500.000 1650500.000 1642200.000 1625600.000 1618600.000 1618900.000 1622200.000 1623500.000 1639500.000 1678900.000 1705500.000 1702200.000 1701700.000 1730800.000 1773400.000 1839600.000 1903600.000 1935000.000 1946100.000 1974100.000 1987800.000 1993400.000 1990200.000 1970600.000 1953100.000 1979000.000 2032800.000 2070000.000 2090200.000 2121100.000 2167300.000 2200000.000 2226600.000 2259300.000 2285700.000 2289000.000 2288500.000 2285800.000 2281300.000 2278800.000 2295100.000 2324500.000 2328500.000 2333000.000 2366700.000 2376200.000 2377400.000 2405100.000 2433000.000 2406100.000 2350800.000 2289500.000 2257800.000 2243500.000 2241000.000 2233600.000 2219900.000 2198000.000 2209300.000 2268700.000 2294100.000 2273600.000 2261700.000 2282800.000 2288000.000 2275700.000 2281300.000 2302800.000 2290900.000 2267800.000 2253600.000 2245300.000 2264200.000 2301000.000 2328600.000 2337000.000 2341500.000 2335200.000 2316700.000 2299900.000 2318700.000 2396400.000 2479200.000 2490300.000 2466000.000 2492400.000 2542800.000 2585500.000 2618100.000 2651700.000 2670800.000 2670000.000 2665400.000 2677400.000 2687000.000 2686200.000 2687700.000 2694200.000 2708600.000 2722800.000 2727900.000 2751300.000 2794600 2845700 2899700 2928600 2914100 2915200 2945600 2973800 2988900 3018400 3063600 3075900 3063800 3077400 3108800 3123300 3106400 3067600 3047400 3054500 3071700 3078300 3073100 3089400 3121700 3134200 3152400 3173600 3195000 3234600 3302400.000 3335800.000 3324200.000 3285100.000 3274100 3298600 3341100 3347100 3305500 3261100 3244000.000 3231400 3183300 3160200 3193500 3262200 3316500
1 61628 10014 New York NY New York New York 379 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1331700.000 1332100.000 1329200.000 1331500.000 1334300.000 1314100.000 1292300.000 1264600.000 1239900.000 1241400.000 1273900.000 1308200.000 1349300.000 1390900.000 1439000.000 1501900.000 1569000.000 1592600.000 1578800.000 1583800.000 1612400.000 1623900.000 1625500.000 1632400.000 1632000.000 1617600.000 1601500.000 1589100.000 1584000.000 1579500.000 1553800.000 1528600.000 1514500.000 1505800.000 1499800.000 1517000.000 1559000.000 1610300.000 1659600.000 1732200.000 1788500.000 1783100.000 1753200.000 1736200.000 1729400.000 1741500.000 1774500.000 1808400.000 1827500.000 1813800.000 1785500.000 1774800.000 1759800.000 1718200.000 1677300.000 1636900.000 1588800.000 1559300.000 1551900.000 1532400.000 1507000.000 1496600.000 1502400.000 1485700.000 1451800.000 1454400.000 1494500.000 1542200.000 1587300.000 1616000.000 1612300.000 1605800.000 1592500.000 1584200.000 1587600.000 1594300.000 1611700.000 1631800.000 1648800.000 1670500.000 1674300.000 1657100.000 1659200.000 1661800.000 1641600.000 1637800.000 1646600.000 1652100.000 1674800.000 1714300.000 1736800.000 1736800.000 1741200.000 1757700.000 1772600.000 1779300.000 1795000.000 1790300.000 1761100.000 1744400.000 1755400.000 1761200.000 1767000.000 1804300.000 1889100.000 1963800.000 2003800.000 2041600 2098500 2143400 2175600 2203100 2219000 2222700 2225400 2215900 2161700 2079100 2018200 1991700 1986200 2027900 2081200 2093200 2082800 2107300 2142400 2167100 2189400 2220100 2250300 2275700 2301600 2314700 2322000 2334500 2360300 2384700.000 2388200.000 2358300.000 2345800.000 2381700 2439700 2483000 2480800 2443200 2430100 2452900.000 2451200 2441900 2460900 2494900 2498400 2491600
2 61625 10011 New York NY New York New York 15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1384000.000 1325500.000 1297400.000 1313500.000 1332400.000 1314500.000 1296700.000 1295100.000 1280000.000 1255300.000 1250800.000 1261200.000 1291100.000 1333900.000 1358500.000 1382600.000 1438800.000 1473100.000 1444500.000 1407100.000 1400300.000 1401600.000 1400500.000 1393700.000 1396100.000 1413200.000 1428200.000 1442400.000 1459000.000 1466500.000 1462000.000 1464300.000 1483300.000 1514600.000 1541900.000 1563500.000 1591300.000 1609900.000 1630300.000 1639200.000 1626700.000 1618900.000 1635400.000 1642000.000 1648300.000 1671100.000 1681400.000 1691600.000 1706200.000 1711100.000 1714100.000 1733400.000 1737700.000 1719900.000 1678200.000 1628400.000 1578300.000 1528400.000 1489200.000 1467000.000 1449400.000 1432100.000 1422700.000 1402200.000 1378600.000 1372300.000 1393100.000 1430000.000 1462100.000 1476800.000 1479000.000 1476600.000 1461200.000 1448300.000 1441600.000 1444300.000 1438600.000 1434100.000 1439300.000 1437700.000 1430300.000 1426800.000 1427800.000 1424600.000 1432800.000 1456500.000 1485100.000 1500200.000 1509600.000 1518500.000 1530800.000 1538000.000 1530500.000 1524500.000 1546500.000 1574800.000 1599600.000 1622500.000 1639000.000 1656100.000 1684600.000 1703000.000 1710000.000 1734300.000 1765200.000 1786000.000 1810700.000 1841500 1867600 1882200 1897000 1917300 1963400 1999200 2003500 2007900 2027700 2043500 2056300 2064500 2066000 2057900 2031300 1999000 1979200 1982900 2001600 2014700 2023500 2055300 2078300 2083600 2088800 2110600 2127500 2168900 2204700 2216100.000 2212500.000 2222600.000 2231900.000 2250800 2285200 2329100 2354000 2355500 2352200 2332100.000 2313300 2319600 2342100 2365900 2419700 2480400
3 97515 94024 Los Altos CA San Jose Santa Clara 4335 419500.000 422300.000 430400.000 440400.000 447100.000 447900.000 448000.000 450100.000 452200.000 456100.000 464600.000 470400.000 472300.000 475400.000 480000.000 483200.000 488900.000 496900.000 504700.000 511300.000 519500.000 525900.000 529600.000 535900.000 542300.000 543300.000 543300.000 550000.000 560000.000 571200.000 582300.000 589800.000 589300.000 582800.000 577600.000 578800.000 580000.000 576600.000 576800.000 585700.000 601700.000 624300.000 648100.000 665900.000 681700.000 699800.000 716400.000 734300.000 758500.000 784600.000 808700.000 831900.000 851100.000 865700.000 876800.000 886200.000 900200.000 920500.000 931300.000 929500.000 922500.000 916300.000 897300.000 865600.000 833500.000 811200.000 796100.000 794200.000 799500.000 798900.000 799900.000 813000.000 831300.000 850200.000 874000.000 891100.000 890900.000 883000.000 879700.000 880100.000 880100.000 877300.000 874100.000 873600.000 878500.000 884500.000 887100.000 886000.000 890700.000 901000.000 910100.000 913600.000 918000.000 927600.000 936400.000 935000.000 931100.000 933300.000 935100.000 940700.000 960100.000 987200.000 1007000.000 1013100.000 1013900.000 1026100.000 1048200.000 1062300.000 1074600.000 1087900.000 1109200.000 1136800.000 1157400.000 1152900.000 1143300.000 1143200.000 1147700.000 1150700.000 1161200.000 1179600.000 1187100.000 1184400.000 1176800.000 1174800.000 1180600.000 1188200.000 1197100.000 1216400.000 1235400.000 1241500.000 1240800.000 1245600.000 1256800.000 1257000.000 1247600.000 1241500.000 1227800.000 1207300.000 1197400.000 1206500.000 1212500.000 1209900.000 1206500.000 1202900.000 1189300.000 1188200.000 1192400.000 1175900.000 1149900.000 1141700.000 1136900.000 1119500.000 1099800.000 1089100.000 1080700.000 1071300.000 1057500.000 1044200.000 1039800.000 1054700.000 1083700.000 1112900.000 1126800.000 1127900.000 1147500.000 1187000.000 1215500.000 1222600.000 1234500.000 1239500.000 1234000.000 1227200.000 1229200.000 1230600.000 1230000.000 1225700.000 1212000.000 1189100.000 1165800.000 1149400.000 1141500.000 1150000.000 1173500.000 1192400.000 1195300.000 1199700.000 1212100.000 1223300.000 1223600.000 1218000.000 1223500.000 1242300.000 1260700.000 1282900.000 1314100.000 1350100.000 1384700.000 1415900.000 1446800.000 1477500.000 1503200.000 1516700.000 1525000.000 1542100.000 1569100.000 1586700.000 1593500.000 1601000.000 1604800 1597200 1593600 1609400 1635100 1653900 1654600 1648100 1660500 1684200 1711700 1743000 1766800 1788900 1831600 1878600 1906300 1927000 1968300 2039400 2105400 2146900 2168900 2181400 2194600 2216100 2234900 2246700 2266900 2311300 2368200.000 2401200.000 2396100.000 2373800.000 2365100 2357300 2348900 2330300 2302700 2269300 2263600.000 2279300 2302700 2321700 2351200 2371500 2378100
4 97519 94028 Portola Valley CA San Francisco San Mateo 7158 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 656200.000 666600.000 684300.000 690200.000 690000.000 679000.000 677500.000 688800.000 695600.000 685000.000 682000.000 689100.000 696500.000 701800.000 712800.000 715500.000 717600.000 737400.000 774600.000 803200.000 818900.000 830100.000 845900.000 865000.000 879600.000 892900.000 907400.000 943900.000 984200.000 1006400.000 1020200.000 1069100.000 1124300.000 1175800.000 1213300.000 1239600.000 1259700.000 1283300.000 1283900.000 1276400.000 1294200.000 1301300.000 1268400.000 1230500.000 1211900.000 1205600.000 1192000.000 1167200.000 1160500.000 1179600.000 1197500.000 1200100.000 1198800.000 1212100.000 1228200.000 1227500.000 1225700.000 1225800.000 1227200.000 1244100.000 1263600.000 1256000.000 1244500.000 1235400.000 1229000.000 1238400.000 1259900.000 1263400.000 1279800.000 1316500.000 1352400.000 1387300.000 1415100.000 1433200.000 1466300.000 1506100.000 1497400.000 1450100.000 1412900.000 1396400.000 1375900.000 1348800.000 1337100.000 1328500.000 1319300.000 1312500.000 1309800.000 1314200.000 1335500.000 1366900.000 1402100.000 1431400.000 1454800.000 1483600.000 1513500.000 1537200.000 1575200.000 1615200.000 1624200.000 1615300.000 1628900.000 1651300.000 1672600.000 1692800.000 1688900.000 1658700.000 1633400.000 1625900.000 1611900.000 1608200.000 1612100.000 1606200.000 1576700.000 1545300.000 1516800.000 1500600.000 1478600.000 1457600.000 1440300.000 1420400.000 1402500.000 1398300.000 1405600.000 1421800.000 1434400.000 1436100.000 1433000.000 1426700.000 1424300.000 1430200.000 1416500.000 1373700.000 1326800.000 1289800.000 1263600.000 1250600.000 1250300.000 1258400.000 1267400.000 1273800.000 1285000.000 1295100.000 1316800.000 1337700.000 1341400.000 1338200.000 1337600.000 1323600.000 1316300.000 1328100.000 1344400.000 1355900.000 1357100.000 1346500.000 1337300.000 1344500.000 1354600.000 1343400.000 1324300.000 1326500.000 1331400.000 1323300.000 1316200.000 1316700.000 1324000.000 1337600.000 1354000.000 1364000.000 1367900.000 1377100.000 1400100.000 1425600.000 1456100.000 1479700.000 1480400.000 1470700.000 1480200.000 1504100.000 1513100.000 1526200.000 1577300.000 1644900.000 1682400.000 1692400.000 1698600.000 1718800.000 1726100 1721200 1724300 1743800 1758500 1772300 1779700 1793800 1817500 1828100 1818900 1809100 1824900 1864200 1913900 1947900 1971100 1992700 2018200 2033500 2044100 2069100 2108600 2131000 2143000 2164400 2178700 2183900 2203200 2239000 2280600.000 2318900.000 2333600.000 2322300.000 2309700 2319500 2344800 2350800 2336500 2317200 2313800.000 2324200 2317300 2294500 2296300 2329400 2358300
In [15]:
home_value_melt = pd.melt(home_value_df, id_vars=['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName'], value_vars=home_value_df.columns[7:])
home_value_melt.rename(columns = {'variable': 'yyyy_mm', 'value': 'median_price_2bedroom'}, inplace=True)
home_value_melt['yyyy_mm'] = pd.to_datetime(home_value_melt['yyyy_mm'])
home_value_melt['year'] = home_value_melt['yyyy_mm'].dt.year
home_value_melt = home_value_melt[(home_value_melt['City'] == 'New York') & (home_value_melt['year'] == 2017)].reset_index(drop=True)
home_value_melt.head()
print(len(home_value_melt))
np.unique(home_value_melt['RegionName'])
150
Out[15]:
array([10003, 10011, 10013, 10014, 10021, 10022, 10023, 10025, 10028,
       10036, 10128, 10303, 10304, 10305, 10306, 10308, 10309, 10312,
       10314, 11201, 11215, 11217, 11231, 11234, 11434], dtype=int64)
In [16]:
home_value_melt['RegionName'] = home_value_melt['RegionName'].astype(str)
home_value_melt = pd.merge(home_value_melt, county_neighborood_mapping, left_on = 'RegionName', right_on = 'zipcode', how = 'left')
print(len(home_value_melt))
home_value_melt.sort_values(by=['RegionName']).head(30)
150
Out[16]:
RegionID RegionName City State Metro CountyName yyyy_mm median_price_2bedroom year zipcode neighbourhood_group_cleansed
128 61617 10003 New York NY New York New York 2017-06-01 2147000.000 2017 10003 Manhattan
3 61617 10003 New York NY New York New York 2017-01-01 1915700.000 2017 10003 Manhattan
78 61617 10003 New York NY New York New York 2017-04-01 2045300.000 2017 10003 Manhattan
53 61617 10003 New York NY New York New York 2017-03-01 1965700.000 2017 10003 Manhattan
103 61617 10003 New York NY New York New York 2017-05-01 2109100.000 2017 10003 Manhattan
28 61617 10003 New York NY New York New York 2017-02-01 1916500.000 2017 10003 Manhattan
77 61625 10011 New York NY New York New York 2017-04-01 2365900.000 2017 10011 Manhattan
2 61625 10011 New York NY New York New York 2017-01-01 2313300.000 2017 10011 Manhattan
52 61625 10011 New York NY New York New York 2017-03-01 2342100.000 2017 10011 Manhattan
127 61625 10011 New York NY New York New York 2017-06-01 2480400.000 2017 10011 Manhattan
102 61625 10011 New York NY New York New York 2017-05-01 2419700.000 2017 10011 Manhattan
27 61625 10011 New York NY New York New York 2017-02-01 2319600.000 2017 10011 Manhattan
125 61627 10013 New York NY New York New York 2017-06-01 3316500.000 2017 10013 Manhattan
0 61627 10013 New York NY New York New York 2017-01-01 3231400.000 2017 10013 Manhattan
50 61627 10013 New York NY New York New York 2017-03-01 3160200.000 2017 10013 Manhattan
25 61627 10013 New York NY New York New York 2017-02-01 3183300.000 2017 10013 Manhattan
100 61627 10013 New York NY New York New York 2017-05-01 3262200.000 2017 10013 Manhattan
75 61627 10013 New York NY New York New York 2017-04-01 3193500.000 2017 10013 Manhattan
51 61628 10014 New York NY New York New York 2017-03-01 2460900.000 2017 10014 Manhattan
76 61628 10014 New York NY New York New York 2017-04-01 2494900.000 2017 10014 Manhattan
126 61628 10014 New York NY New York New York 2017-06-01 2491600.000 2017 10014 Manhattan
1 61628 10014 New York NY New York New York 2017-01-01 2451200.000 2017 10014 Manhattan
101 61628 10014 New York NY New York New York 2017-05-01 2498400.000 2017 10014 Manhattan
26 61628 10014 New York NY New York New York 2017-02-01 2441900.000 2017 10014 Manhattan
32 61635 10021 New York NY New York New York 2017-02-01 1649800.000 2017 NaN NaN
7 61635 10021 New York NY New York New York 2017-01-01 1643700.000 2017 NaN NaN
82 61635 10021 New York NY New York New York 2017-04-01 1731400.000 2017 NaN NaN
132 61635 10021 New York NY New York New York 2017-06-01 1815600.000 2017 NaN NaN
57 61635 10021 New York NY New York New York 2017-03-01 1688500.000 2017 NaN NaN
107 61635 10021 New York NY New York New York 2017-05-01 1774100.000 2017 NaN NaN
In [17]:
home_value_melt.describe(include = 'all').T
Out[17]:
count unique top freq first last mean std min 25% 50% 75% max
RegionID 150.000 NaN NaN NaN NaT NaT 61783.160 161.822 61617.000 61637.000 61780.000 61790.000 62120.000
RegionName 150 25 11234 6 NaT NaT NaN NaN NaN NaN NaN NaN NaN
City 150 1 New York 150 NaT NaT NaN NaN NaN NaN NaN NaN NaN
State 150 1 NY 150 NaT NaT NaN NaN NaN NaN NaN NaN NaN
Metro 150 1 New York 150 NaT NaT NaN NaN NaN NaN NaN NaN NaN
CountyName 150 4 New York 66 NaT NaT NaN NaN NaN NaN NaN NaN NaN
yyyy_mm 150 6 2017-06-01 00:00:00 25 2017-01-01 2017-06-01 NaN NaN NaN NaN NaN NaN NaN
median_price_2bedroom 150.000 NaN NaN NaN NaT NaT 1236808.000 832658.825 299000.000 377125.000 1231850.000 1814875.000 3316500.000
year 150.000 NaN NaN NaN NaT NaT 2017.000 0.000 2017.000 2017.000 2017.000 2017.000 2017.000
zipcode 24 4 10003 6 NaT NaT NaN NaN NaN NaN NaN NaN NaN
neighbourhood_group_cleansed 24 1 Manhattan 24 NaT NaT NaN NaN NaN NaN NaN NaN NaN
In [18]:
#Impute missing neighborhood name
preprocess_listings_df['int_zipcode'] = preprocess_listings_df['zipcode'].astype(int)
home_value_melt['int_zipcode'] = home_value_melt['RegionName'].astype(int)

impute_zip = preprocess_listings_df[['neighbourhood_group_cleansed','int_zipcode']].groupby(['neighbourhood_group_cleansed']).median().reset_index()
impute_zip = impute_zip.rename(columns={"neighbourhood_group_cleansed": "impute_neighbourhood_group_cleansed", "int_zipcode": "median_zipcode"})
#print(impute_zip)
home_value_melt2 = home_value_melt.merge(impute_zip, how='cross')
#print(len(home_value_melt2))
home_value_melt2['zipp_diff'] = np.abs(home_value_melt2['int_zipcode'] - home_value_melt2['median_zipcode'])
#print(len(home_value_melt2))
home_value_melt2['zipp_diff_rank'] = home_value_melt2[['RegionName','yyyy_mm','zipp_diff']].groupby(['RegionName','yyyy_mm'])['zipp_diff'].rank(method='first')
#print(len(home_value_melt2))

#print(home_value_melt[['RegionName','neighbourhood_group_cleansed','int_zipcode']].sort_values(by=['RegionName']).head(30))
#print(home_value_melt2[['RegionName','neighbourhood_group_cleansed','int_zipcode','zipp_diff_rank','median_zipcode']].sort_values(by=['RegionName']).head(30))

home_value_melt2 = home_value_melt2[home_value_melt2['zipp_diff_rank'] == 1]
#print(len(home_value_melt2))
home_value_melt2.loc[home_value_melt2['neighbourhood_group_cleansed'].isnull(), 'neighbourhood_group_cleansed'] = home_value_melt2.loc[home_value_melt2['neighbourhood_group_cleansed'].isnull(), 'impute_neighbourhood_group_cleansed']
home_value_melt2.drop(['impute_neighbourhood_group_cleansed','int_zipcode','median_zipcode','zipp_diff','zipp_diff_rank'], axis=1, inplace=True)
#print(len(home_value_melt2))

4.2 Function to EDA Home Value Data¶

In [19]:
def eda_home_value(df = home_value_melt2, bar_plot = ['RegionName'], box_plot = ['CountyName']):
    
    for column in box_plot:
        fig = px.box(df, y="median_price_2bedroom", color=column, facet_col=column, boxmode="overlay", points='all', title="2-Bedroom Price by "+column)
        fig.show()
    
    for column in bar_plot:
        
        groupby_df = df.groupby([column,'CountyName'])['median_price_2bedroom'].median()\
                                .reset_index().sort_values(by=['median_price_2bedroom'], ascending=False).reset_index(drop = True)
        
        groupby_df[column] = groupby_df[column].astype(str)

        fig = px.bar(groupby_df.head(20), x=column, y="median_price_2bedroom", title="Median 2-Bedroom Price by "+column, color='CountyName', labels={'median_price_2bedroom':'Count'})
        fig.show()
        
eda_home_value()
In [20]:
#home_value_melt2.sort_values(by=['RegionName','yyyy_mm'])
In [21]:
#np.unique(home_value_melt2['RegionName'])
In [22]:
zip_code_2017_prices = home_value_melt2.drop(['yyyy_mm'],axis=1).groupby(['RegionName','RegionID','year'])['median_price_2bedroom'].median().reset_index()
In [23]:
zip_code_2017_prices.head()
Out[23]:
RegionName RegionID year median_price_2bedroom
0 10003 61617 2017 2005500.000
1 10011 61625 2017 2354000.000
2 10013 61627 2017 3212450.000
3 10014 61628 2017 2476250.000
4 10021 61635 2017 1709950.000

5. Profit Analysis for Different Time Horizons¶

In [24]:
preprocess_listings_df['zipcode'] = preprocess_listings_df['zipcode'].astype(str)
print(len(preprocess_listings_df))
home_value_melt['RegionName'] = home_value_melt['RegionName'].astype(str)
merge_listings_homevalue = pd.merge(preprocess_listings_df, zip_code_2017_prices, left_on = 'zipcode', right_on = 'RegionName', how = 'left')
print(len(merge_listings_homevalue))
merge_listings_homevalue.head()
199
199
Out[24]:
id last_scraped host_id neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed market city zipcode state property_type room_type accommodates availability_30 availability_60 availability_90 availability_365 bathrooms bedrooms bed_type beds square_feet price weekly_price monthly_price security_deposit occupancy_30 occupancy_60 occupancy_90 occupancy_365 estimate_occupancy int_zipcode RegionName RegionID year median_price_2bedroom
0 4357134 5/5/17 7809661 Astoria Ditmars Steinway Queens New York New York 11102 NY Apartment Entire home/apt 4 0 0 0 0 1 2 Real Bed 2.000 950.000 200.000 999.500 3000.000 250.000 100.000 100.000 100.000 100.000 100.000 11102 NaN NaN NaN NaN
1 13612947 5/5/17 76332312 NaN Astoria Queens New York New York 11106 NY Apartment Entire home/apt 4 29 59 89 364 1 2 Real Bed 3.000 900.000 150.000 1000.000 3375.000 100.000 3.000 2.000 1.000 0.000 3.000 11106 NaN NaN NaN NaN
2 18430507 5/3/17 127732143 Columbia Street Waterfront Astoria Queens New York New York 11102 NY Apartment Entire home/apt 4 29 59 89 89 2 2 Real Bed 2.000 1300.000 290.000 1300.000 4647.500 500.000 3.000 2.000 1.000 76.000 76.000 11102 NaN NaN NaN NaN
3 294239 5/5/17 1521432 Astoria Astoria Queens New York New York 11106 NY Apartment Entire home/apt 6 0 5 20 20 1 2 Real Bed 2.000 950.000 200.000 1050.000 3000.000 250.000 100.000 92.000 78.000 95.000 100.000 11106 NaN NaN NaN NaN
4 9944426 5/4/17 51105827 Battery Park City Battery Park City Manhattan New York New York 10006 NY Apartment Entire home/apt 5 27 57 87 362 1 2 Real Bed 2.000 950.000 225.000 999.500 3000.000 250.000 10.000 5.000 3.000 1.000 10.000 10006 NaN NaN NaN NaN
In [25]:
## Impute Missing Prices as not all ZipCodes in Listings data for NYC exists in Zillows 2-bedroom cost data
print("Total 2-Bedroom Listings: ",len(merge_listings_homevalue))
print("2-Bedroom Listings with Mssing Price: ", sum(merge_listings_homevalue['RegionName'].isnull()))
Total 2-Bedroom Listings:  199
2-Bedroom Listings with Mssing Price:  97
In [26]:
impute_propoerty_cost = merge_listings_homevalue[~merge_listings_homevalue['median_price_2bedroom'].isnull()][['median_price_2bedroom','property_type']].groupby(['property_type']).median().reset_index()
impute_propoerty_cost.rename(columns={'median_price_2bedroom':'impute_median_price_2bedroom'}, inplace=True)
impute_propoerty_cost
Out[26]:
property_type impute_median_price_2bedroom
0 Apartment 2354000.000
1 Condominium 2005500.000
2 House 2354000.000
3 Loft 2415125.000
4 Townhouse 2354000.000
In [27]:
merge_listings_homevalue = pd.merge(merge_listings_homevalue, impute_propoerty_cost, on = 'property_type', how = 'left')
merge_listings_homevalue.loc[merge_listings_homevalue['median_price_2bedroom'].isnull(), 'median_price_2bedroom'] = merge_listings_homevalue.loc[merge_listings_homevalue['median_price_2bedroom'].isnull(), 'impute_median_price_2bedroom']
merge_listings_homevalue.head()
Out[27]:
id last_scraped host_id neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed market city zipcode state property_type room_type accommodates availability_30 availability_60 availability_90 availability_365 bathrooms bedrooms bed_type beds square_feet price weekly_price monthly_price security_deposit occupancy_30 occupancy_60 occupancy_90 occupancy_365 estimate_occupancy int_zipcode RegionName RegionID year median_price_2bedroom impute_median_price_2bedroom
0 4357134 5/5/17 7809661 Astoria Ditmars Steinway Queens New York New York 11102 NY Apartment Entire home/apt 4 0 0 0 0 1 2 Real Bed 2.000 950.000 200.000 999.500 3000.000 250.000 100.000 100.000 100.000 100.000 100.000 11102 NaN NaN NaN 2354000.000 2354000.000
1 13612947 5/5/17 76332312 NaN Astoria Queens New York New York 11106 NY Apartment Entire home/apt 4 29 59 89 364 1 2 Real Bed 3.000 900.000 150.000 1000.000 3375.000 100.000 3.000 2.000 1.000 0.000 3.000 11106 NaN NaN NaN 2354000.000 2354000.000
2 18430507 5/3/17 127732143 Columbia Street Waterfront Astoria Queens New York New York 11102 NY Apartment Entire home/apt 4 29 59 89 89 2 2 Real Bed 2.000 1300.000 290.000 1300.000 4647.500 500.000 3.000 2.000 1.000 76.000 76.000 11102 NaN NaN NaN 2354000.000 2354000.000
3 294239 5/5/17 1521432 Astoria Astoria Queens New York New York 11106 NY Apartment Entire home/apt 6 0 5 20 20 1 2 Real Bed 2.000 950.000 200.000 1050.000 3000.000 250.000 100.000 92.000 78.000 95.000 100.000 11106 NaN NaN NaN 2354000.000 2354000.000
4 9944426 5/4/17 51105827 Battery Park City Battery Park City Manhattan New York New York 10006 NY Apartment Entire home/apt 5 27 57 87 362 1 2 Real Bed 2.000 950.000 225.000 999.500 3000.000 250.000 10.000 5.000 3.000 1.000 10.000 10006 NaN NaN NaN 2354000.000 2354000.000
In [28]:
merge_listings_homevalue['break_even_year'] = np.round(merge_listings_homevalue['median_price_2bedroom']/(365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100),0)
merge_listings_homevalue['25_year_profit'] = np.round(25*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue['30_year_profit'] = np.round(30*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue['35_year_profit'] = np.round(35*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue['40_year_profit'] = np.round(40*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue['45_year_profit'] = np.round(45*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue['50_year_profit'] = np.round(50*365*merge_listings_homevalue['price']*merge_listings_homevalue['estimate_occupancy']/100 - merge_listings_homevalue['median_price_2bedroom'],0)
merge_listings_homevalue.head()
Out[28]:
id last_scraped host_id neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed market city zipcode state property_type room_type accommodates availability_30 availability_60 availability_90 availability_365 bathrooms bedrooms bed_type beds square_feet price weekly_price monthly_price security_deposit occupancy_30 occupancy_60 occupancy_90 occupancy_365 estimate_occupancy int_zipcode RegionName RegionID year median_price_2bedroom impute_median_price_2bedroom break_even_year 25_year_profit 30_year_profit 35_year_profit 40_year_profit 45_year_profit 50_year_profit
0 4357134 5/5/17 7809661 Astoria Ditmars Steinway Queens New York New York 11102 NY Apartment Entire home/apt 4 0 0 0 0 1 2 Real Bed 2.000 950.000 200.000 999.500 3000.000 250.000 100.000 100.000 100.000 100.000 100.000 11102 NaN NaN NaN 2354000.000 2354000.000 32.000 -529000.000 -164000.000 201000.000 566000.000 931000.000 1296000.000
1 13612947 5/5/17 76332312 NaN Astoria Queens New York New York 11106 NY Apartment Entire home/apt 4 29 59 89 364 1 2 Real Bed 3.000 900.000 150.000 1000.000 3375.000 100.000 3.000 2.000 1.000 0.000 3.000 11106 NaN NaN NaN 2354000.000 2354000.000 1433.000 -2312938.000 -2304725.000 -2296512.000 -2288300.000 -2280088.000 -2271875.000
2 18430507 5/3/17 127732143 Columbia Street Waterfront Astoria Queens New York New York 11102 NY Apartment Entire home/apt 4 29 59 89 89 2 2 Real Bed 2.000 1300.000 290.000 1300.000 4647.500 500.000 3.000 2.000 1.000 76.000 76.000 11102 NaN NaN NaN 2354000.000 2354000.000 29.000 -342850.000 59380.000 461610.000 863840.000 1266070.000 1668300.000
3 294239 5/5/17 1521432 Astoria Astoria Queens New York New York 11106 NY Apartment Entire home/apt 6 0 5 20 20 1 2 Real Bed 2.000 950.000 200.000 1050.000 3000.000 250.000 100.000 92.000 78.000 95.000 100.000 11106 NaN NaN NaN 2354000.000 2354000.000 32.000 -529000.000 -164000.000 201000.000 566000.000 931000.000 1296000.000
4 9944426 5/4/17 51105827 Battery Park City Battery Park City Manhattan New York New York 10006 NY Apartment Entire home/apt 5 27 57 87 362 1 2 Real Bed 2.000 950.000 225.000 999.500 3000.000 250.000 10.000 5.000 3.000 1.000 10.000 10006 NaN NaN NaN 2354000.000 2354000.000 287.000 -2148688.000 -2107625.000 -2066562.000 -2025500.000 -1984438.000 -1943375.000

5.1 Function to analyze profit estimations for various time horizons¶

In [29]:
def eda_profit(df = merge_listings_homevalue, bar_plot = ['break_even_year','30_year_profit','40_year_profit'], box_plot = ['break_even_year','30_year_profit','40_year_profit'], max_break_even_horizon = 100):

    for column in box_plot:  
        df_pivot = pd.pivot_table(df[df['break_even_year'] < max_break_even_horizon], values=[column], index=['zipcode'], aggfunc=('median',min,max), fill_value=0)
        df_pivot = df_pivot[df_pivot[column]['min'] > 0]
        print(df_pivot.reindex(df_pivot[column].sort_values(by='median', ascending=False).index))
        fig = px.box(df[df['break_even_year'] < max_break_even_horizon], y=column, color="zipcode", facet_col="zipcode", boxmode="overlay", points='all', title=column+"by zipcode")
        for anno in fig['layout']['annotations']:
            anno['text']=''        
        fig.show()
    
    for column in bar_plot:             
        groupby_df = df[df['break_even_year'] < max_break_even_horizon].groupby(["zipcode","neighbourhood_group_cleansed"])[column].median()\
                                .reset_index().sort_values(by=[column], ascending=False).reset_index(drop = True)
        
        #groupby_df[column] = groupby_df[column].astype(str)

        fig = px.bar(groupby_df.head(20), x="zipcode", y=column, title=column+"by zipcode", color="neighbourhood_group_cleansed")
        fig.show()
        
eda_profit()
        break_even_year           
                    max median min
zipcode                           
11221                53 53.000  53
10013                88 47.000  15
11238                46 46.000  46
10038                88 35.000  27
10002                73 34.500  12
11106                32 32.000  32
11102                32 30.500  29
10011                98 26.000   7
10280                32 26.000  14
10014                28 25.500  23
10001                86 22.000   6
10004                21 21.000  21
10010                24 21.000   6
10003                11  8.000   7
        30_year_profit                     
                   max      median      min
zipcode                                    
10003          6754500 5205075.000  3458550
10004           931000  931000.000   931000
10010          9691000  931000.000   602500
10014           808750  472037.500   135325
        40_year_profit                  
                   max   median      min
zipcode                                 
10003          9674500  7608600  5279900
10004          2026000  2026000  2026000
10010         13706000  2026000  1588000
10014          1903750  1454800  1005850
10280          4216000  1325565   566000
11102           863840   714920   566000
11106           566000   566000   566000
  • We see Zip Codes in Mahattan have the smallest break even horizon.
  • For a guaranteed positive ROI over a 30 year or 40 year horizon, only invest in Manhattan zip codes.
  • If some risk is acceptable, there is also a high likelihood of positive ROI in couple of Queens neighborhood over 40 year horizon which can help the real estate company establish presence across NYC than just narrowing to Manhattan. Could be beneficial as more folks are moving from city center to ouskirts. Also, property prices may rise more sharply on the outskirts since city centers are already saturated and virtual work from home trend is rising.

6. Conclusion¶

6.1 Insights¶

1. What's the equation of Profit?¶

Profit = (Number of Years) x 365 x (Avg. Listing Price for One Day Booking) x (Estimate of Occupancy Rate) - (Estimate of Property Cost of a 2 Bedroom in the respective zip code)¶

2. What's the Occupancy Rate for the listings?¶

Overall Occupancy Estimate: 90.0¶

Occupancy by zipcode:¶

max median min
zipcode
10003 100 100.000 75
10004 100 100.000 100
10010 100 100.000 90
11238 100 100.000 100
10011 100 94.000 0
10002 100 90.000 3
11221 90 90.000 90
10001 100 89.500 0
10013 100 88.000 0
11102 100 88.000 76
10280 100 87.000 0
10038 100 85.500 30
10014 100 76.500 53
11106 100 51.500 3
10006 10 10.000 10
11226 10 10.000 10

3. Which properties have the least breakeven years for initial capital investment?¶

The zipcodes [10003,10004,10010] have the least break_even_year¶

max median min
zipcode
10003 11 8.000 7
10004 21 21.000 21
10010 24 21.000 6
10001 86 22.000 6
10014 28 25.500 23
10011 98 26.000 7
10280 32 26.000 14
11102 32 30.500 29

4. What zip codes to buy for 30 years investment horizon?¶

ROI in USD¶

Filtering to Zip Codes with non-negative ROI¶

max median min
zipcode
10003 6754500 5205075.000 3458550
10010 9691000 931000.000 602500
10004 931000 931000.000 931000
10014 808750 472037.500 135325

5. What zip codes to buy for 40 years investment horizon?¶

ROI in USD¶

Filtering to Zip Codes with non-negative ROI¶

max median min
zipcode
10003 9674500 7608600 5279900
10010 13706000 2026000 1588000
10004 2026000 2026000 2026000
10014 1903750 1454800 1005850
10280 4216000 1325565 566000
11102 863840 714920 566000
11106 566000 566000 566000

6. Final Recommendation¶

The real estate company should focus on finding good deals and buy 2-Bedroom 2-Bathroom with Apartment Property Types in following zip codes (within the neighborhoods of Chelsea, China Town and Astoria):¶

(A) For 30-year investment horizon¶
  • Manhattan Zip Codes: 10003, 10010, 10004
(B) For 40-year investment horizon¶
  • Manhattan Zip Codes: 10003, 10010, 10004
  • Queens ZIp Codes: 11102

6.2 Data Quality Checks & Derived Fields¶

Listings Data¶

  • 'square_feet': Impute missing values by median square_feet within grouping by ('bathrooms','bedrooms','beds','property_type')
  • 'price': Remove the currency prefix and convert to numeric
  • 'weekly_price': Remove the currency prefix and convert to numeric
  • 'monthly_price': Remove the currency prefix and convert to numeric
  • 'security_deposit': Remove the currency prefix and convert to numeric
  • 'zipcode': Remove decimals, hyphens, suffix and just keep the zipcode. Convert back to string and treat as category
  • 'host_id': Convert to string and treat as category
  • 'bathrooms': Convert to string and treat as category
  • 'bedrooms': Convert to string and treat as category

Home Value Data¶

  • Transform the data from wide to long and filter for the max year i.e. 2017
  • Then take median of property prices across the dates for every RegionName (i.e. zipcode)

Outliers¶

  • Use Median for aggregation instead of Mean to minimize the effect of noisy outliers in both datasets
  • Remove listings with estimated breakeven years > 100 i.e. exclude noisy data

Derived Fields¶

  • 'occupancy_30': (30 - 'availability_30')*100 / 30
  • 'occupancy_60': (60 - 'availability_60')*100 / 60
  • 'occupancy_90': (90 - 'availability_90')*100 / 90
  • 'occupancy_365': (365 - 'availability_365')*100 / 365
  • 'estimate_occupancy': maxmimum among 'occupancy_30', 'occupancy_60', 'occupancy_90', 'occupancy_365'

6.3 Methodology¶

Exhaustive feature processing:¶

  • Irrespective of whether a feature is used for this particular analysis the methodology has pre-processed and clean all key features which could be used for follow-up analysis

Approach:¶

  • Calculate an estimate of revenue for every listing = estimate of daily price number of years estimate of occupancy rate
  • Calculate upfront property costs for every listing = estimate of 2-bedroom property costs
  • Calculate profits for every listing = estimate of revenue - estimate of costs

Automation:¶

  • Using functions with arguments to extend the
    1. Raw Data loading AND
    1. preprocessing, eda and visualization

6.4 Next Steps¶

  • Extract features from text like amenities like 'ac', 'balcony' to further understand the value of property
  • Build a regression model to forecast the 365 days occupancy of a listing using features like neighborhood, #bathrooms, square feet, review count, review sentiments, amenities, poplation size rank
  • Build an interactive Flask dashboard given more time