Abstract
The following document is the analysis of the AirBnB and Zillow data, aimed at providing recommendations to client on the best investment regions for short-time rental listing.
This project aims at looking at the cost and revenue data of houses in New York, aimed at recommending key areas of profitability. Data is sourced from two key locations for the same purpose. The idea would be to understand the key features of profitability of a house by identifying the various features, and digging into the
The problem aims at consulting for a real estate company that is looking to invest in two bedroom properties in New York City. The company has a niche of purchasing the properties and renting it out on a short-term basis. The particular ask to identify which zip codes are the best to invest in.
This solution aims at digging deep into the available data sources, finding the key variables/metrics that can be used to best define and describe the neighourhood and help calculate the key performance indices (like revenue, reviews and so on). The approach splits the main ask into a number of sub-questions in order to tackle the problem and the analyses on the individual tasks aim to provide a holistic solution.
The key ask - “What are the best zip codes in New York City to invest in a 2-bed room apartment for short-term rentals?” is broken down as such into certain key questions aimed at providing a specific perspective.
The analyses uses the AirBnB listing data (Revenue) and Zillow data (cost). The Analysis is seperated into the following key sections
#List of required (CRAN) packages
pkgs <- c("data.table", "Hmisc", "plotly", "dplyr", "tidyr")
#Install required packages
for(pkg in pkgs){
if(!(pkg %in% installed.packages()[,"Package"])){
install.packages(pkg)
}
}
require(pkgs)
lapply(pkgs, require, character.only = TRUE)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
The datasets are loaded into R based on the file name. There are two datasets that are available for this particular analysis -
The datasets that have been loaded are verified and explored to understand the dimensions, the data format and the values present in the columns with a key objective of cleaning and extracting the apt data for further analyses.
#Check the number of records and variables
dim(airbnb_data_raw)
## [1] 48895 106
The AirBnB data consists of 106 variables and 48895 listings present in it.
#Check for duplicates in the data
airbnb_data_raw[which(duplicated(airbnb_data_raw) == T),]
airbnb_data_raw[which(duplicated(airbnb_data_raw$id) == T),]
There are no overall duplicates in the data, nor with the IDs. Each record could be uniquely identified using the ID column.
numMissing <- contents(airbnb_data_raw, sort = c("NAs"))$contents
table(numMissing$Storage)
##
## character double integer logical
## 64 7 32 3
A majority of the columns are characters, containing the description and the location of the listing. A brief view of the same is giving below.
numMissing[order(-numMissing$NAs),]
A view of the integer columns are present below.
intNumMissing <- numMissing[which(numMissing$Storage %in% c("integer", "numeric", "double")),]
intNumMissing[order(-intNumMissing$NAs),]
A lot of the description data regarding the host and the property listing is not necessary for preliminary analysis. This would have to be removed. There are also a lot of NA(missing values) in the columns, a summary of which is present in the above Views.
A glimpse of the price column suggests that there is a preceding $, that renders the column as a character. This character is removed and the resultant value is formated as a numeric value.
airbnb_data_raw$price_cleaned <- as.numeric(gsub('[$,]','',airbnb_data_raw$price))
summary(airbnb_data_raw$price_cleaned)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 69.0 106.0 152.7 175.0 10000.0
Checking the summary and spread of the price variables, we can sense that a few outliers might be present (Eg. Price 0 and Price 10000, which seems exorbidant).
A glimpse of the zipcode column suggests that a few values are in the 5-4 zip format, the hyphen causing it to be rendered as a character column. This is fine because they should not be treated as an integer even though they are made of numbers. Splitting zipcodes into 2 columns - the 5 number zip and the additional 4 digit code
#Using the separate function to create the "zipcode_9" columns
airbnb_data_raw <- airbnb_data_raw %>% separate(zipcode, c("zipcode","zipcode_9"),"-")
It is noticable that there are zipcodes with NA values (as shown in the above tables). Each listing needs to have an identifiable zipcode as this is pivotal to the analysis. This is fixed by matching the latitude-longitude of these missing records, and extracting the closest zipcode from the rest of the data (latitude-longitude rounded up to 2 digits).
The first step is to create a zipcode lookup table for the latitude-longitudes.
zipcodes_lookup <- airbnb_data_raw[,c("latitude", "longitude", "zipcode")]
zipcodes_lookup$latitude <- round(zipcodes_lookup$latitude, 2)
zipcodes_lookup$longitude <- round(zipcodes_lookup$longitude, 2)
zipcodes_lookup <- zipcodes_lookup[!duplicated(zipcodes_lookup$latitude, zipcodes_lookup$longitude),]
names(zipcodes_lookup)[which(names(zipcodes_lookup)=="zipcode")] <- "zipcode_new"
Once we have created a zipcode lookup table, we impute the zipcodes for the missing records.
airbnb_data_raw$latitude_round <- round(airbnb_data_raw$latitude, 2)
airbnb_data_raw$longitude_round <- round(airbnb_data_raw$longitude, 2)
airbnb_data_raw <- merge(airbnb_data_raw, zipcodes_lookup, by.x = c("latitude_round", "longitude_round"), by.y = c("latitude", "longitude"), all.x = T)
airbnb_data_raw$zipcode_2 <- airbnb_data_raw$zipcode
airbnb_data_raw$zipcode_2[which(is.na(airbnb_data_raw$zipcode_2))] <- airbnb_data_raw$zipcode_new[which(is.na(airbnb_data_raw$zipcode_2))]
There are a few columns that help identify the location of the particular listing. The columns are the street, neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed, city, state, market, smart_location, country_code and country. Since all of these are character columns, the aim is to identify the apt variable (or combination of variables) that would provide sufficient depth for drill down of our analysis. A summary of this is present below
neighbourhood_uniques <- data.table(column = names(sapply(airbnb_data_raw[,c("street", "neighbourhood", "neighbourhood_cleansed", "neighbourhood_group_cleansed", "city", "state" , "market", "smart_location", "country_code", "country")], function(x) length(unique(x)))), uniqueValue = sapply(airbnb_data_raw[,c("street", "neighbourhood", "neighbourhood_cleansed", "neighbourhood_group_cleansed", "city", "state" , "market", "smart_location", "country_code", "country")], function(x) length(unique(x))))
neighbourhood_uniques
A brief summary of the variables are as follows-
Based on the above data exploration, a couple of key columns are extracted for EDA and deep dives from the AirBnB dataset
The variables that are chosen for further analyses are as follows.
contents(airbnb_data_extract)$contents
The client is interested in 2 bedroom apartments. A quick view of the spread across bedrooms is as below.
plot_ly(x = as.data.frame(table(airbnb_data_extract$bedrooms))$Var1, y = as.data.frame(table(airbnb_data_extract$bedrooms))$Freq, color = as.data.frame(table(airbnb_data_extract$bedrooms))$Var1, type = "bar", showlegend = FALSE) %>% layout(title = "Frequency Distributon of Listings Across Bedrooms", xaxis = list(title = "Bedrooms"), yaxis = list(title = "Count"))
There are a lot of bedroom values that are missing. While this could be imputed with a square feet analysis, it is safer to exclude those records for the time being. It is observed that most of the listings are for a single bedroom. The hypothesis here could be that the host is renting out a single room in the apartment they stay in, which is quite common. As the client is not interested in such a scenario, the data is further subsetted for 2 bedroom listings only.
airbnb_data_subset <- airbnb_data_extract[which(airbnb_data_extract$bedrooms == 2),]
setDT(airbnb_data_subset)
The resultant dataset contains of 6497 listings and 47 variables for our analyses.
As noticed above, the cleaned up price columns looks to have certain outliers. The spread of the price is calculated for each property type.
summary_price <- setDT(airbnb_data_subset)[, list(min = min(price_cleaned), Q1 = quantile(price_cleaned)[2], mean = mean(price_cleaned), Q3 = quantile(price_cleaned)[4], max = max(price_cleaned), count = length(price_cleaned)), by = .(property_type)]
summary_price[order(-count)]
The top 6 property types are chosen for the analysis. The remaining types account for very few listings and may be discarded. The “Serviced apartment”, though significantly lower than the other listings, are still included in the analyses because they are priced higher for each night, providing for potential investments.
top_property <- summary_price$property_type[order(-summary_price$count)][1:6]
To demonstrate what a statistical outlier looks like, the data has been standardized for each property type and extreme points that are 20SD away from the mean are removed to provide a clearer view of the data.
airbnb_data_subset[, scale:= scale(price_cleaned, center = T, scale = T), by = .(property_type)]
plot_ly(airbnb_data_subset[which(property_type %in% top_property & scale < 20)], y = ~price_cleaned, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Price Distribution Across Boroughs", xaxis = list(title = "Boroughs"), yaxis = list(title = "Price"))
It looks like listings in Manhattan are usually priced higher than those in other regions, though there are certain high priced listings in Brooklyn. Manhattan and Brooklyn make up the key neighbourhoods and potential investment zones.
The listings with price 0 are removed from the analyses. A few of the listings with high prices were manually checked on the website, and the prices were verified to be what was actually listed. Hence, the higher price listings are kept as is.
airbnb_data_subset <- airbnb_data_subset[which(price_cleaned >0),]
airbnb_data_subset[,revenue_year := price_cleaned * 0.75 * 365]
summary(airbnb_data_subset$revenue_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2738 35588 50644 63663 72544 2737226
It is noted that there are certain apartment listings that generate upward of $2 million dollars every year. Usually, a listing is expected to generate 50k to 70k every year. A further breakdown of this information is provided below for each property type.
summary_revenue <- setDT(airbnb_data_subset)[, list(min = min(revenue_year), Q1 = quantile(revenue_year)[2], mean = mean(revenue_year), Q3 = quantile(revenue_year)[4], max = max(revenue_year), count = length(revenue_year)), by = .(property_type)]
summary_revenue[order(-summary_revenue$count)]
It is noticeable that while Apartments are the most common listings, a Condominium or a Serviced Apartment could possibly generate higher revenue if occupancy rate is high.
#Check the number of records and variables
dim(zillow_data_raw)
## [1] 8946 262
The Zillow data consists of 262 variables and 8946 listings present in it.
#Check for duplicates in the data
zillow_data_raw[which(duplicated(zillow_data_raw) == T),]
zillow_data_raw[which(duplicated(zillow_data_raw$RegionID) == T),]
There are no overall duplicates in the data, nor with the RegionIDs. Each record could be uniquely identified using the RegionID column.
zillow_numMissing <- contents(zillow_data_raw, sort = c("NAs"))$contents
table(zillow_numMissing$Storage)
##
## character integer
## 4 258
A majority of the columns are integer values, containing the median price of the listing, with each column representing a particular year-month. A brief view of the same is given below.
zillow_numMissing[order(-zillow_numMissing$NAs),]
The zipcode is given by the RegionName column, that is represented as an integer. There are a few columns that help identify the location. The columns are the city, state, metro and countyname. A summary of this is present below.
sapply(zillow_data_raw[,c("City", "State", "Metro", "CountyName")], function(x) length(unique(x)))
## City State Metro CountyName
## 4684 48 467 722
A study of the unique States and Cities shows that the information present in the Zillow data is not restricted to New York but is for other cities also. The data is subsetted for city “New York”
city <- c("New York")
zillow_data_extract <- zillow_data_raw[which(zillow_data_raw$City == city),]
The data is transformed into a long format with the year-month as key and the median price information as the value.
year_cols <- names(zillow_data_extract)[grep("^X", names(zillow_data_extract), perl = T)]
zillow_transformed <- gather(data = zillow_data_extract, key = "Year", value = "Median_Price", year_cols)
zillow_transformed$Year <- as.character(gsub('[X,]','',zillow_transformed$Year))
zillow_transformed <- zillow_transformed %>% separate(Year, c("Year", "Month"),"[.]")
The trend of the cost is verified in order to choose the appropriate next step for our analyses. This is done by creating a table of the median prices and ploting the trend across years as below.
zillow_transformed$RegionName <- as.factor(zillow_transformed$RegionName)
zillow_transformed %>% group_by(RegionName, Year) %>% filter(Year > 2010) %>% summarise(median_price_year = median(Median_Price)) %>%
plot_ly(x=~as.factor(Year), y=~median_price_year, group = ~RegionName, type = "scatter", color = ~RegionName, mode = "lines+markers") %>% layout(title = "Cost Variation Across Years", xaxis = list(title = "Year"), yaxis = list(title = "Cost"))
While it is noticed that there is a general increase in the median prices of the property across years, the increase has leveled off or is negligible over the past few years, with expections in certain zip codes (the higher priced areas) - 10013, 10028, 10025 to name a few. For ease of use and interpretation, the median price for 2017 is used as the cost of the property for further analyses.
The median price for 2017 is calculated for each zipcode.
setDT(zillow_transformed)
zillow_data_subset <- setDT(zillow_transformed[Year == 2017,])[, Median_Price_2017 := median(Median_Price), by = RegionName]
zillow_data_subset <- unique(zillow_data_subset[,.(RegionName, Median_Price_2017)])
The final dataset contains 25 zipcodes and the median price for 2017, that will be used alongside the AirBnB data.
The AirBnB and the Zillow data are merged based on the zipcodes in order to consolidate the cost information with the listings.
airbnb_zillow_data <- merge(airbnb_data_subset, zillow_data_subset, by.x = "zipcode_2", by.y = "RegionName")
The merged dataset contains of 1567 listings and cost information, with 50 variables of information that include the location, price, reviews, revenue and availability of a listing.
The merged data set is checked to see if there are any duplicates or issues created.
zipcode_match_check <- unique(airbnb_zillow_data[,.(neighbourhood_group_cleansed, zipcode_2)])
airbnb_zillow_data$neighbourhood_group_cleansed[which(airbnb_zillow_data$zipcode_2 == 10013)] <- "Manhattan"
The review column and availability are analysed with this limited number of records for ease of interpretation.
Taking a look at the review columns.
summary(airbnb_zillow_data$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 4.00 19.79 17.00 403.00
summary(airbnb_zillow_data$reviews_per_month)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.160 0.470 1.188 1.702 12.050 367
There are a lot of missing records in the reviews_per_month columns, and this may have to be recalculated using the number of reviews and duration
There are also some listings that have been reviewed a lot in a short duration of time, but has not been reviews since (which suggests manipulation of reviews and data by the hosts).
In order to use the reviews, it is better to create a derived metric which is the average reviews per month from the time of the first review to the current date. This ensures that we capture consistency and recency much better than the existing reviews per month.
The following functions are used to calculate the difference between two dates in number of months.
#Calculate number of months passed from a reference point
mon_ref <- function(date) { days <- as.POSIXlt(as.Date(date, origin="1900-01-01"));
days$year*12 + days$mon }
#Compute difference between two months
mon_diff <- function(date1, date2) { mon_ref(date2) - mon_ref(date1) }
airbnb_zillow_data[, month_diff_todate := mon_diff(as.Date(first_review), Sys.Date())]
airbnb_zillow_data[, review_pm_todate := number_of_reviews/month_diff_todate]
airbnb_zillow_data$review_pm_todate[!is.finite(airbnb_zillow_data$review_pm_todate)] <- 0
This metric can be used as a placeholder for the popularity of the listing.
sapply(airbnb_zillow_data[,.(availability_30, availability_60, availability_90, availability_365)], summary)
## availability_30 availability_60 availability_90 availability_365
## Min. 0.000000 0.00000 0.00000 0.0000
## 1st Qu. 0.000000 0.00000 0.00000 0.0000
## Median 0.000000 7.00000 12.00000 59.0000
## Mean 5.887045 15.85386 25.77026 120.7064
## 3rd Qu. 8.000000 27.00000 49.00000 247.0000
## Max. 30.000000 60.00000 90.00000 365.0000
The availability columns looks to have a lot of 0’s present, which means that the lisitng is not available at all for booking for that time period. This seems suspicious as it is the case for a lot of listings. This might be because of some data entry issue. In order to perform any analyses better, it might be a safer approach to subset for listings with any non-zero value till the time that it can be safely judged as not a data error.
The aim of the deep dive is to come up with an idea of what zipcodes are profitable, are popular and can be deemed as good investment areas. This can be achieved in multiple stages, each aimed at revealing a particular aspect of the area by the following steps.
What is the neighbourhood that has the most listings in NYC?
The data is analysed in order to understand how the listings are spread across the boroughs and the zipcodes presently. This would serve as an indication of which is the popular area currently in NYC.
plot_ly(x = as.data.frame(table(airbnb_zillow_data$neighbourhood_group_cleansed))$Var1, y = as.data.frame(table(airbnb_zillow_data$neighbourhood_group_cleansed))$Freq, name = "Frequency", color = as.data.frame(table(airbnb_zillow_data$neighbourhood_group_cleansed))$Var1, type = "bar", showlegend = FALSE) %>% layout(title = "Frequency Distributon of Listings Across Boroughs", xaxis = list(title = "Boroughs"), yaxis = list(title = "Count"))
airbnb_zillow_data[, .(count = .N), by = .(neighbourhood_group_cleansed, zipcode_2)] %>%
plot_ly(x = ~zipcode_2, y = ~count, type = "bar", color = ~neighbourhood_group_cleansed) %>% layout(title = "Frequency Distributon of Listings Across Zipcode", xaxis = list(title = "Zipcode"), yaxis = list(title = "Count"))
The summary of the above analyses is present below.
listing_count <- airbnb_zillow_data[, .(count = .N), by = .(neighbourhood_group_cleansed, zipcode_2)]
listing_count[, list_rank := frank(-count), by = list(neighbourhood_group_cleansed)]
listing_count[, overall_list_rank := frank(-count)]
listing_count[, overall_list_rank_unique := frank(-count, ties.method = "first")]
setkey(listing_count, overall_list_rank)
listing_count
How does the price reflect across region and zipcodes?
The price that is demanded by each host is analysed as this could be a clear indication of which zone has more demand.
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)], y = ~price_cleaned, color = ~property_type, type = "box") %>% layout(title = "Price Distributon of Listings Across Property Type", xaxis = list(title = "Property Type"), yaxis = list(title = "Price $"))
It is noticed that while generally, there are other apartment types that may be higher priced on average, Apartments are usually priced higher than other types. The data also is heavily skewed in the number of apartment listings, which cannot be ignored. This is an indication that apartments could be a prefered property type by tourists and renters.
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)], y = ~price_cleaned, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Price Distributon of Listings Across Boroughs", xaxis = list(title = "Boroughs"), yaxis = list(title = "Price $"))
Manhattan and Brooklyn are clearly the two boroughs that bring in most of the revenue. This could be attributed to the inherent popularity and the sheer volume of tourist spots in the surrounding regions. This is further broken down into the zipcodes as below.
airbnb_zillow_data[, .(median_price = median(price_cleaned)), by = .(neighbourhood_group_cleansed, zipcode_2)] %>%
plot_ly(x = ~zipcode_2, y = ~median_price, type = "bar", color = ~neighbourhood_group_cleansed) %>% layout(title = "Price Distributon of Listings Across Zipcodes", xaxis = list(title = "Zipcode"), yaxis = list(title = "Median Price $"))
When the median price/night is analysed across zipcodes, it reveals that all listings across Manhattan and priced generally higher than the listings in other areas. The spread of the prices for each zipcode is displayed below.
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)],x=~zipcode_2, y = ~price_cleaned, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Price Distributon of Listings Across Zipcodes", xaxis = list(title = "Zipcode"), yaxis = list(title = "Price $"))
While the data points to certain listings that mathetically look to be outliers, there is no strong business case against them. The trend noticed above is reflected here again, with listings in Manhattan priced more than Brooklyn and listings in Staten Island and Queens are the least.
The summary of the above analyses is present below.
listing_median_price <- airbnb_zillow_data[, .(median_price = median(price_cleaned)), by = .(neighbourhood_group_cleansed, zipcode_2)]
listing_median_price[, median_rank := frank(-median_price), by = list(neighbourhood_group_cleansed)]
listing_median_price[, overall_median_rank := frank(-median_price)]
listing_median_price[, overall_median_rank_unique := frank(-median_price, ties.method = "first")]
setkey(listing_median_price, overall_median_rank)
listing_median_price
How does the revenue calculated vary across region and zipcode?
While the price per night is an indication of how much a listing would cost, the revenue it brings in is useful to identify the top revenue bringing zip-codes.
Since we do not have varying occupancy rates for this analysis, the spread across individual regions would remain the same as before. But because of the sheer number of listings present in Manhattan and Brooklyn, along with the general higher cost, we notice that the revenue brought in by the two boroughs are significantly higher than the other boroughs.
airbnb_zillow_data[, .(sum_revenue = sum(revenue_year)), by = .(neighbourhood_group_cleansed, zipcode_2)] %>%
plot_ly(x = ~zipcode_2, y = ~sum_revenue, type = "bar", color = ~neighbourhood_group_cleansed) %>% layout(title = "Revenue Distributon of Listings Across Zipcodes", xaxis = list(title = "Zipcodes"), yaxis = list(title = "Sum Revenue $"))
How much would the investment be across different regions?
While we have analysed the revenue and the price of a listing in a region, it would also be necessary to look at the investment necessary in each of the region.
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)], y = ~Median_Price_2017, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Investment Distributon of Listings Across Boroughs", xaxis = list(title = "Borough"), yaxis = list(title = "Investment $"))
It can be seen that it takes a significantly higher investment to own and rent out a property in Manhattan as compared to Brooklyn.
airbnb_zillow_data[, .(median_investment = median(Median_Price_2017)), by = .(neighbourhood_group_cleansed, zipcode_2)] %>%
plot_ly(x = ~zipcode_2, y = ~median_investment, type = "bar", color = ~neighbourhood_group_cleansed) %>% layout(title = "Investment Distributon of Listings Across Zipcodes", xaxis = list(title = "Zipcode"), yaxis = list(title = "Median Investment $"))
The summary of the above analyses is present below.
listing_investment_price <- airbnb_zillow_data[, .(median_investment = median(Median_Price_2017)), by = .(neighbourhood_group_cleansed, zipcode_2)]
listing_investment_price[, investment_rank := frank(median_investment), by = list(neighbourhood_group_cleansed)]
listing_investment_price[, overall_investment_rank := frank(median_investment)]
listing_investment_price[, overall_investment_rank_unique := frank(median_investment, ties.method = "first")]
setkey(listing_investment_price, overall_investment_rank)
listing_investment_price
How are the reviews across different region and zipcode?
While it is obvious that Manhattan and Brooklyn are the most profitable (in terms of revenue) and also the most costly in terms of investment, it might be useful to check the reviews and availability as an indicator of people’s preference in the area.
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)], y = ~review_pm_todate, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Reviews Distributon of Listings Across Boroughs", xaxis = list(title = "Boroughs"), yaxis = list(title = "Review Per Month"))
While there are only a handful of listings in Staten Island and Queens, they look to be more popular amongst the tourists as compared to the listings in Manhattan and Brooklyn (which does have some very poorly reviewed listings). If a listing is popular, it can be checked for its availability over a period of 3 months as below.
plot_ly(data = airbnb_zillow_data[which(review_pm_todate > 0 & availability_90 > 0)], x = ~availability_90, y = ~review_pm_todate, color = ~neighbourhood_group_cleansed) %>% layout(title = " Variation of Review Per Month and Availability (over 90 days)", xaxis = list(title = "Availability over 90 days"), yaxis = list(title = "Review Per Month"))
While there is no clear pattern in the data, it is evident that reviews play an important role for pre-booking. It might be thought of that if a listing is reviewed very well, it has a better chance of being booked in advance, and on that notion, lisitings in Queens and Staten Island might serve to be good investments.
A summary of the above analysis is presented below.
listing_reviews_temp_data <- airbnb_zillow_data[which(review_pm_todate >0 & availability_90 > 0),]
listing_reviews <- listing_reviews_temp_data[, .(median_reviews_pm = median(review_pm_todate)), by = .(neighbourhood_group_cleansed, zipcode_2)]
listing_reviews[, review_rank := frank(-median_reviews_pm), by = list(neighbourhood_group_cleansed)]
listing_reviews[, overall_reviews_rank := frank(-median_reviews_pm)]
listing_reviews[, overall_reviews_rank_unique := frank(-median_reviews_pm, ties.method = "first")]
setkey(listing_reviews, overall_reviews_rank)
listing_reviews
What would be the breakeven time for the investment in a region?
A final question to ask would be on how many years would it take to reach the break even point, assuming the current revenue generated and occupancy rate (with no expected change over years). This value would add significant weight on each area, because it might not be viable to investment heavily only for the returns realized over a significant number of years.
airbnb_zillow_data[, breakeven := Median_Price_2017/revenue_year]
plot_ly(airbnb_zillow_data[which(property_type %in% top_property)], y = ~breakeven, color = ~neighbourhood_group_cleansed, type = "box") %>% layout(title = "Breakeven Distributon of Listings Across Boroughs", xaxis = list(title = "Boroughs"), yaxis = list(title = "Breakeven Point in Years"))
As expected, some listings in Manhattan will only be realized after a significant amount of years. A deep dive at a region level is present below.
airbnb_zillow_data[, .(median_breakeven = median(breakeven)), by = .(neighbourhood_group_cleansed, zipcode_2)] %>%
plot_ly(x = ~zipcode_2, y = ~median_breakeven, type = "bar", color = ~neighbourhood_group_cleansed) %>% layout(title = "Breakeven Distributon of Listings Across Zipcodes", xaxis = list(title = "Zipcode"), yaxis = list(title = "Median Breakeven Point in Years"))
It is faster to realize the money invested in Queens and Staten Island, as compared to the other boroughs. This is worth consideration.
A summary of the analysis is present below.
listing_breakeven <- airbnb_zillow_data[, .(median_breakeven = median(breakeven)), by = .(neighbourhood_group_cleansed, zipcode_2)]
listing_breakeven[, breakeven_rank := frank(median_breakeven), by = list(neighbourhood_group_cleansed)]
listing_breakeven[, overall_breakeven_rank := frank(median_breakeven)]
listing_breakeven[, overall_breakeven_rank_unique := frank(median_breakeven, ties.method = "first")]
setkey(listing_breakeven, overall_breakeven_rank)
listing_breakeven
The above deep dives had a single objective - to provide individual recommendations based on a regions current scenario, popularity, investment needed, profitability and customer attraction. The inherent understanding is that while it would be tough to come up with an objective list of top zipcodes, it would be better and somewhat fruitful to break it down by key drivers of investment. A particular zipcode that brings in a lot of revenue might not be the most profitable investment. Another zipcode that does not have many listings might be popular amongst customer and may provide with quicker returns. A simple univariate decision cannot be made. A better approach might be to come up with a list of recommendations as below.
Renaming certain summary columns for better consolidation and interpretation
names(listing_count)[which(names(listing_count) %in% c("neighbourhood_group_cleansed", "zipcode_2"))] <- c("Count_Neighbourhood", "Count_Zip")
names(listing_median_price)[which(names(listing_median_price) %in% c("neighbourhood_group_cleansed", "zipcode_2"))] <- c("Price_Neighbourhood", "Price_Zip")
names(listing_investment_price)[which(names(listing_investment_price) %in% c("neighbourhood_group_cleansed", "zipcode_2"))] <- c("Invest_Neighbourhood", "Invest_Zip")
names(listing_reviews)[which(names(listing_reviews) %in% c("neighbourhood_group_cleansed", "zipcode_2"))] <- c("Review_Neighbourhood", "Review_Zip")
names(listing_breakeven)[which(names(listing_breakeven) %in% c("neighbourhood_group_cleansed", "zipcode_2"))] <- c("Breakeven_Neighbourhood", "Breakeven_Zip")
listing_count[,.(overall_list_rank, Count_Neighbourhood, Count_Zip)][1:10,]
listing_median_price[,.(overall_median_rank, Price_Neighbourhood, Price_Zip)][1:10,]
listing_investment_price[,.(overall_investment_rank, Invest_Neighbourhood, Invest_Zip)][1:10,]
listing_reviews[,.(overall_reviews_rank, Review_Neighbourhood, Review_Zip)][1:10,]
listing_breakeven[,.(overall_breakeven_rank, Breakeven_Neighbourhood, Breakeven_Zip)][1:10,]
And when consolidated together, we get…
listing_overall <- merge(listing_count[,c("overall_list_rank_unique", "Count_Neighbourhood", "Count_Zip")], listing_median_price[, c("overall_median_rank_unique", "Price_Neighbourhood", "Price_Zip")], by.x = "overall_list_rank_unique", by.y = "overall_median_rank_unique")
listing_overall <- merge(listing_overall, listing_investment_price[, c("overall_investment_rank_unique", "Invest_Neighbourhood", "Invest_Zip")], by.x = "overall_list_rank_unique", by.y = "overall_investment_rank_unique")
listing_overall <- merge(listing_overall, listing_reviews[, c("overall_reviews_rank_unique", "Review_Neighbourhood", "Review_Zip")], by.x = "overall_list_rank_unique", by.y = "overall_reviews_rank_unique")
listing_overall <- merge(listing_overall, listing_breakeven[, c("overall_breakeven_rank_unique", "Breakeven_Neighbourhood", "Breakeven_Zip")], by.x = "overall_list_rank_unique", by.y = "overall_breakeven_rank_unique")
listing_overall[,.(overall_list_rank_unique, Count_Zip, Price_Zip, Invest_Zip, Review_Zip, Breakeven_Zip)]
It is also possible to come up with a single list of priority based on the above analysis.
listing_singular <- merge(listing_count[,.(Count_Zip, Count_Neighbourhood, overall_list_rank)], listing_median_price[,.(Price_Zip, overall_median_rank)], by.x = "Count_Zip", by.y = "Price_Zip")
listing_singular <- merge(listing_singular, listing_investment_price[,.(Invest_Zip, overall_investment_rank)], by.x = "Count_Zip", by.y = "Invest_Zip")
listing_singular <- merge(listing_singular, listing_reviews[,.(Review_Zip, overall_reviews_rank)], by.x = "Count_Zip", by.y = "Review_Zip")
listing_singular <- merge(listing_singular, listing_breakeven[,.(Breakeven_Zip, overall_breakeven_rank)], by.x = "Count_Zip", by.y = "Breakeven_Zip")
listing_singular[, overall_Score := overall_list_rank * overall_median_rank * overall_investment_rank * overall_reviews_rank * overall_breakeven_rank]
setkey(listing_singular, overall_Score)
listing_singular[,.(Count_Zip, Count_Neighbourhood, overall_Score)]
The top investments based on all the above factors look to be in the areas
The above analysis can be bettered through various improvements in data consolidation, analysis and review. A few of them are suggested below.