Airbnb is a global online hospitality service that connects travellers with lodging from local homeowners. The website and mobile application provide a marketplace for individuals to book or offer rooms. With services across numerous cities across the globe, Airbnb contains massive amounts of data on thousands of listings per region. In particular, San Francisco is where Airbnb was founded, and remains the location of its headquarters. It is a burgeoning city undergoing an economic boom due to the technology industry, and it would be interesting to observe rental housing prices under these circumstances.
The data used for this analysis is obtained from Inside Airbnb, a website that provides data scraped from public listings on the Airbnb website. I will be analyzing San Francisco data that was compiled on October 03, 2018. In particular, I will be using the “Detailed Listings data for San Francisco” with a file name of “listings.csv.gz”, and “Detailed Review Data for listings in San Francisco” with a file name of “reviews.csv.gz”. These datasets contain information about the same listings, so I will join them together by the variable listing_id
.
The primary goal of this analysis is to identify the most important factors that affect housing prices, and attempt to predict prices based on these variables. Additionally, I would like to perform sentiment analysis on the reviews, and provide summary statistics of reviews for each listing.
After looking at Airbnb listing prices and examining the variables, I concluded that the most important factors that affect price are the type of room and the number of bedrooms. These results are intuitive. However, I would like to focus on more surprising findings.
While location matters, the data does not provide a good way to include location in a model.
# Load packages
library(tidyverse)
library(ggmap)
library(knitr)
# Import datasets
prices <- read_csv(file = ("data/processed/prices.csv"), col_types = cols("n", col_factor(NULL), col_factor(NULL), "n", "n", "n", "c", "n", "n"))
sentiments <- read_csv("data/processed/sentiments.csv")
listings <- read_csv("data/processed/listings.csv")
sf <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("toner-lines"), zoom = 13)
ggmap(sf) +
geom_point(data = prices, aes(x = longitude, y = latitude, color = price), alpha = 0.7) +
scale_colour_gradient(low = '#a4faff', high = '#00275b') +
labs(title = "Prices by Location")
When overlaying location on a map, I can see that the center and northeast tend to have darker colors, indicating higher prices. Unfortunately, linear models do not pick up on this trend. A possible explanation is that longitude and latitude don’t vary much across a city.
Another surprising finding is that number of bathrooms
does not affect price
.
prices %>%
filter(bedrooms <= 5, bathrooms <= 5) %>%
ggplot(aes(x = as.factor(bathrooms), y = price)) +
geom_boxplot() +
facet_wrap(~bedrooms) +
labs(title = "Price vs Number of Bathrooms per Number of Bedrooms", x = "Number of Bathrooms", y = "Price")
A possible explanation may be that the number of bathrooms
is dependent on the number of bedrooms
, and only affects price
through bedrooms
.
The last interesting aspect is the sentiment analysis I performed on reviews. I found sentiment scores by subtracting the number of negative words from the number of positive words in each review, and I averaged the scores.
# sentiments per review
listing_sent <- sentiments %>%
# group by reviewer
group_by(reviewer_id) %>%
# count positive and negatve sentiments
count(listing_id, sentiment) %>%
# make new positive and negative count columns
spread(key = sentiment, value = n, fill = 0) %>%
# calculate overall sentiment per review
mutate(sent = positive - negative)
# average sentiments per listing
listing_sent <- listing_sent %>%
# group by listing
group_by(listing_id) %>%
# average sentiment per listing
summarize(avg_sent = mean(sent)) %>%
# join with original dataset
left_join(listings, by = c("listing_id" = "id"))
listing_sent %>%
filter(price <= 250) %>%
ggplot(aes(x = price, y = avg_sent)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
labs(title = "Average Sentiment Score vs Price",
x = "Price", y = "Average Sentiment Score")
Looking at most of the Airbnb’s, there is a strong relationship between price
and average sentiment score.