Airbnb is a global online hospitality service that connects travellers with lodging from local homeowners. The website and mobile application provide a marketplace for individuals to book or offer rooms. With services across numerous cities across the globe, Airbnb contains massive amounts of data on thousands of listings per region. In particular, San Francisco is where Airbnb was founded, and remains the location of its headquarters. It is a burgeoning city undergoing an economic boom due to the technology industry, and it would be interesting to observe rental housing prices under these circumstances.
The data used for this analysis is obtained from Inside Airbnb, a website that provides data scraped from public listings on the Airbnb website. I will be analyzing San Francisco data that was compiled on October 03, 2018. In particular, I will be using the “Detailed Listings data for San Francisco” with a file name of “listings.csv.gz”, and “Detailed Review Data for listings in San Francisco” with a file name of “reviews.csv.gz”. These datasets contain information about the same listings, so I will join them together by the variable listing_id
.
The primary goal of this analysis is to identify the most important factors that affect housing prices, and attempt to predict prices based on these variables. Additionally, I would like to perform sentiment analysis on the reviews, and provide summary statistics of reviews for each listing.
# Load packages
library(tidyverse)
library(modelr)
library(caret)
library(ggmap)
library(broom)
library(knitr)
# Import datasets
prices <- read_csv(file = ("data/processed/prices.csv"), col_types = cols("n", col_factor(NULL), col_factor(NULL), "n", "n", "n", "c", "n", "n"))
After importing prices
, a processed dataset that contains listing prices and several important factors that may go into daily rates, I will first take a look at the composition of the data.
prices %>% glimpse()
## Observations: 6,695
## Variables: 9
## $ price <dbl> 170, 235, 65, 65, 785, 255, 139, 135, 2...
## $ neighbourhood_cleansed <fct> Western Addition, Bernal Heights, Haigh...
## $ room_type <fct> Entire home/apt, Entire home/apt, Priva...
## $ accommodates <dbl> 3, 5, 2, 2, 5, 6, 3, 2, 6, 2, 4, 3, 5, ...
## $ bathrooms <dbl> 1.0, 1.0, 4.0, 4.0, 1.5, 1.0, 1.0, 1.0,...
## $ bedrooms <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2, 0, 3, 3, 3, ...
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen...
## $ latitude <dbl> 37.76931, 37.74511, 37.76669, 37.76487,...
## $ longitude <dbl> -122.4339, -122.4210, -122.4525, -122.4...
prices
contains 9 columns:
price
: daily base rate for booking the roomneighbourhood_cleansed
: neighborhood of listing (36 total)room_type
: type of room (3 total)accomodates
: number of individuals that can be accomodatedbathrooms
: number of bathroomsbedrooms
: number of bedroomsamenities
: amenities availablelatitude
, longitude
: the latitude or longitude of the AirbnbI will examine each variable and see whether there are any interesting relationships.
Since individuals generally choose a hotel based on where they are travelling, I believe that neighbourhood_cleansed
, longitude
, and latitude
are very influential on price. The more popular tourist areas are most likely more expensive.
I sorted the 36 neighborhoods in San Francisco based on their average housing price into 6 groups. Below are the groupings for reference.
# Find avg price per neighborhood
prices_neigh <- prices %>%
group_by(neighbourhood_cleansed) %>%
mutate(avg_price = round(mean(price), 2))
# Label neighborhoods by price range
prices_neigh[["neigh_cat"]] <- cut(prices_neigh$avg_price, 6, labels = c("Cheapest", "Cheap", "Moderately Cheap", "Moderately Expensive", "Expensive", "Most Expensive"))
# Neighborhood groups
prices_neigh %>%
select(neighbourhood_cleansed, neigh_cat, avg_price) %>%
group_by(neighbourhood_cleansed) %>%
unique() %>%
arrange(neigh_cat, avg_price)
The average price per neighborhood ranges from as low as $105 per night for the cheapest neighborhoods to $287.83 per night for the most expensive.
Now, let’s look at price for each neighbourhood_cleansed
and accomodates
pair to see what the general price an Airbnb would be for different numbers of people.
prices_neigh %>%
filter(accommodates <= 6) %>%
ggplot(aes(x = as.factor(accommodates), y = fct_reorder(neighbourhood_cleansed, avg_price))) +
geom_tile(aes(fill = price)) +
scale_fill_gradientn(colors = c("#c1e8ff", "#002b42")) +
labs(title = "Price vs Neighborhood Group and Accommodates",
x = "Accommodates", y = "Neighborhood Group")
The plot above shows neighbourhood_cleansed
in ascending order of average price, and accommodates
. The color represents price
. Generally, the color is getting darker as accommodates
and neighbourhood_cleansed
increase. This indicates that for more expensive tourist areas and more people, Airbnb’s tend to be more expensive. However, the trend does not appear to be very clear.
To visualize the individuals listings, I’ll overlay the listings on a map of San Francisco and color the points by price
.
sf <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("toner-lines"), zoom = 13)
ggmap(sf) +
geom_point(data = prices, aes(x = longitude, y = latitude, color = price), alpha = 0.7) +
scale_colour_gradient(low = '#a4faff', high = '#00275b') +
labs(title = "Prices by Location")
Based on this map, it seems as though most of the Airbnb listings are clustered in the center of the city and towards the northeast. This makes sense, since the center region is the heart of downtown San Francisco, and the northeast is next to the bay, which has many tourist attractions.
There are three types of rooms:
Entire home/apt
indicates that the listing is for a complete house or apartment for the guestPrivate room
indicates that the listing is for a room within a house or apartment that may be occupied by othersShared room
indicates that the room is shared with other individuals.
Intuitively, I would expect that the more private the listing, the higher the price. Additionally, I would expect a higher number of bedrooms to be more expensive.
prices %>%
filter(bedrooms <= 5) %>%
group_by(room_type) %>%
mutate(avg_price = mean(price)) %>%
ggplot(aes(x = room_type, color = as.factor(bedrooms))) +
geom_boxplot(aes(y = price)) +
geom_point(aes(y = avg_price, size = 1), color = "red") +
labs(title = "Price vs Room Type by Number of Bedrooms", x = "Room Type", y = "Price", color = "Number of Bedrooms", size = "Average Price")
The plot above confirms our hypotheses. The red points indicate the average price for each room_type
. I can see a clear decreasing pattern. There are also boxplots for number of bedrooms, represented by different colors. For each room_type
, the price increases as bedrooms
increases.
I also have data on the number of bathrooms
. Whole bathrooms that include a bath or shower are counted as one, while half bathrooms only have a toilet and a sink. It seems as though bathrooms
would be dependent on bedrooms
, so it’s correlation to price
would come from bedrooms
instead of itself.
prices %>%
filter(bedrooms <= 5, bathrooms <= 5) %>%
ggplot(aes(x = as.factor(bathrooms), y = price)) +
geom_boxplot() +
facet_wrap(~bedrooms) +
labs(title = "Price vs Number of Bathrooms per Number of Bedrooms", x = "Number of Bathrooms", y = "Price")
Looking at the plot above, which displays price
vs bathrooms
faceted by bedrooms
, it seems as though there is not relationship between bathrooms
and price
for each number of bedrooms. Only at 4 bedrooms is there a slightly positive correlation between bathrooms
and price
. For the other number of bedrooms
, they are not positively correlated, so I conclude that bathrooms
does not affect price
.
I will now look at the number of individuals accommodated by one listing and view the relationship with price
. I expect a positive relationship, since generally, more individuals would require a larger space and more expensive house.
prices %>%
ggplot(aes(x = as.factor(accommodates), y = price)) +
geom_boxplot(aes(color = room_type)) +
geom_smooth(aes(x = accommodates), se = FALSE, color = "purple") +
labs(title = "Price vs Number Accommodated", x = "Number Accommodated", y = "Price", color = "Room Type")
The boxplots above represent price
for the number of people accommodated. For each number of people, there is a different boxplot based on room_type
. There is a positive relationship until accommodates
reaches 9 people. This may be caused by a smaller number of observations that easily affect the trend. We still see the same relationship between room_type
and price
.
To work with amenities
, which is a string, I will first convert it to a vector of lists and replace it in prices
.
# Convert amenities from char to list
amen <- prices$amenities
amen_list <- vector("list", length(amen))
for (i in seq_along(amen)){
# remove punctuation and extra characters
amen_list[[i]] = str_remove_all(amen[[i]], pattern = '[^([A-Z][a-z][0-9]|,|\'|\\s)]') %>%
# split by comma
str_split(pattern = ",")
}
prices[["amenities"]] <- amen_list
Now each Airbnb listing has a list of amenities
. Next, I want to look at the 15 most frequently listed amenities
.
# Count frequency
amen_freq <- prices$amenities %>%
unlist() %>%
table()
# Remove unwanted rows
amen_freq <- amen_freq[-c(1, 168, 169)]
# Top 15
amen_freq %>%
sort(decreasing = TRUE) %>%
head(15) %>% kable(col.names = c("Amenities", "Freq"), align='c')
Amenities | Freq |
---|---|
Wifi | 6607 |
Essentials | 6488 |
Heating | 6312 |
Smoke detector | 6241 |
Hangers | 5995 |
Hair dryer | 5688 |
Kitchen | 5661 |
Shampoo | 5630 |
Iron | 5484 |
Laptop friendly workspace | 5411 |
Carbon monoxide detector | 5384 |
TV | 5288 |
Washer | 4736 |
Dryer | 4728 |
Fire extinguisher | 4614 |
Many of these amenities
seem reasonable, as they are common in hotels. I would expect there to be “Wifi”, “Shampoo”, and a “TV”, since these are pretty standard for the lodging industry. Some interesting ones are “Smoke detector”, “Carbon monoxide detector”, and “Fire extinguisher”. I would not have expected those to be listed as amenities
. However, it makes sense, since they are legally required to be installed in every house or apartment in San Francisco. The reason that they are not the most common amenities may be that some people do not consider them to be amenities, so they have not listed them on Airbnb. However, I expect those to be the most common amenities if they were listed by every Airbnb host that provides them.
It would be interesting to see how amenities
differs for expensive and cheap Airbnb listings.
# Split into two groups by price
amen_prices <- prices %>%
mutate(above_average = ifelse(price > median(price), TRUE, FALSE))
# Amenities per group
expensive_amenities <- amen_prices %>%
filter(above_average == TRUE) %>%
select(amenities) %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
cheap_amenities <- amen_prices %>%
filter(above_average == FALSE) %>%
select(amenities) %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
kable(list(head(expensive_amenities, 10), head(cheap_amenities, 10)), col.names = c("Amenities", "Freq"), align='c')
|
|
The above table shows the top 10 most frequently listed amenities
for listings at above median prices (left) and below median prices(right). There does not seem to be a large difference in the amenities
. This may be because price
differs more on quality than simply which amenities
are offered. Another explanation is that common amenities
are provided by essentially all Airbnb hosts. Lastly, amenities
may not be that important when considering price
. The more important factors may be location or size.
I have examined all the variables in the prices
dataset, so I will move on to attempting to predict price
. I will test a few different linear models and determine which one best fits prices
, then apply that model to a testing dataset and see how accurate the model is.
# Model dataset
dataset <- prices %>%
select(-amenities)
# Train-Test Split
set.seed(100)
pindex <- createDataPartition(dataset$price, p = 0.8, list = FALSE)
ptrain <- dataset[pindex,]
ptest <- dataset[-pindex,]
First, split the dataset into train and test sets. 80% of the data is in the ptrain
set, and 20% is in ptest
.
# Linear models
# all data
price_lm_1 <- ptrain %>%
lm(price ~ ., data = .)
price_lm_1$call
## lm(formula = price ~ ., data = .)
glance(price_lm_1) %>% kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | |
---|---|---|---|---|---|---|---|---|---|---|---|
value | 0.5048863 | 0.5009739 | 98.82789 | 129.0455 | 0 | 43 | -32192.42 | 64472.83 | 64762.63 | 51911352 | 5315 |
# all except lat and lon
price_lm_2 <- ptrain %>%
lm(price ~ neighbourhood_cleansed + room_type + accommodates
+ bathrooms + bedrooms, data = .)
price_lm_2$call
## lm(formula = price ~ neighbourhood_cleansed + room_type + accommodates +
## bathrooms + bedrooms, data = .)
glance(price_lm_2) %>% kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | |
---|---|---|---|---|---|---|---|---|---|---|---|
value | 0.5031746 | 0.499437 | 98.97996 | 134.6237 | 0 | 41 | -32201.66 | 64487.32 | 64763.95 | 52090823 | 5317 |
# all except neighborhood
price_lm_3 <- ptrain %>%
lm(price ~ room_type + accommodates + bathrooms +
bedrooms + latitude + longitude, data = .)
price_lm_3$call
## lm(formula = price ~ room_type + accommodates + bathrooms + bedrooms +
## latitude + longitude, data = .)
glance(price_lm_3) %>% kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | |
---|---|---|---|---|---|---|---|---|---|---|---|
value | 0.4871795 | 0.4865085 | 100.25 | 726.0715 | 0 | 8 | -32286.55 | 64591.1 | 64650.38 | 53767868 | 5350 |
Above are the model formulas and their summary values. The first model has the highest R2 value.
# Add predictions and residuals
ptest <- ptest %>%
add_predictions(model = price_lm_1, var = "1_pred") %>%
add_predictions(model = price_lm_2, var = "2_pred") %>%
add_predictions(model = price_lm_3, var = "3_pred") %>%
add_residuals(model = price_lm_1, var = "1_resid") %>%
add_residuals(model = price_lm_2, var = "2_resid") %>%
add_residuals(model = price_lm_3, var = "3_resid")
kable(list(sum(ptest$`1_resid`^2),
sum(ptest$`2_resid`^2),
sum(ptest$`3_resid`^2)), col.names = "Sum of Squared Residuals")
|
|
|
Above are the sums of the squared residuals for each model. The first model has the lowest value. However, none of the models have a particularly good fit. Thus, a linear model may not be the best choice in attempting to predict price. I will still take a look at the first model to see what relationships exist between price
and the other variables.
summary(price_lm_1)
##
## Call:
## lm(formula = price ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -757.18 -51.87 -13.30 30.51 751.13
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -59576.070 29045.197 -2.051
## neighbourhood_cleansedBernal Heights 5.004 14.275 0.351
## neighbourhood_cleansedHaight Ashbury -3.829 8.689 -0.441
## neighbourhood_cleansedMission 13.975 9.675 1.444
## neighbourhood_cleansedPotrero Hill 23.736 13.525 1.755
## neighbourhood_cleansedNob Hill -38.823 10.386 -3.738
## neighbourhood_cleansedMarina 7.794 12.027 0.648
## neighbourhood_cleansedDowntown/Civic Center 5.343 8.696 0.614
## neighbourhood_cleansedCastro/Upper Market 30.454 8.753 3.479
## neighbourhood_cleansedInner Sunset -21.247 13.384 -1.587
## neighbourhood_cleansedSouth of Market 10.800 9.470 1.140
## neighbourhood_cleansedNoe Valley 40.644 11.707 3.472
## neighbourhood_cleansedPacific Heights 19.162 10.927 1.754
## neighbourhood_cleansedPresidio Heights -10.918 20.534 -0.532
## neighbourhood_cleansedGlen Park 25.189 19.073 1.321
## neighbourhood_cleansedTwin Peaks 41.603 16.357 2.543
## neighbourhood_cleansedOcean View -0.405 21.746 -0.019
## neighbourhood_cleansedFinancial District 40.325 14.048 2.871
## neighbourhood_cleansedOuter Richmond -47.637 16.658 -2.860
## neighbourhood_cleansedRussian Hill 7.680 12.616 0.609
## neighbourhood_cleansedOuter Sunset -28.895 15.986 -1.807
## neighbourhood_cleansedNorth Beach -23.265 13.593 -1.711
## neighbourhood_cleansedInner Richmond -37.777 11.309 -3.340
## neighbourhood_cleansedExcelsior -13.396 19.664 -0.681
## neighbourhood_cleansedSeacliff -86.969 28.326 -3.070
## neighbourhood_cleansedChinatown 18.582 13.451 1.381
## neighbourhood_cleansedWest of Twin Peaks 23.744 17.581 1.351
## neighbourhood_cleansedBayview -2.601 21.056 -0.124
## neighbourhood_cleansedDiamond Heights 32.365 28.987 1.117
## neighbourhood_cleansedOuter Mission 14.628 18.952 0.772
## neighbourhood_cleansedParkside -25.496 18.773 -1.358
## neighbourhood_cleansedGolden Gate Park 13.588 42.129 0.323
## neighbourhood_cleansedLakeshore -2.232 23.480 -0.095
## neighbourhood_cleansedCrocker Amazon 3.034 28.472 0.107
## neighbourhood_cleansedVisitacion Valley 6.404 25.249 0.254
## neighbourhood_cleansedPresidio -77.806 99.929 -0.779
## room_typePrivate room -39.839 3.225 -12.353
## room_typeShared room -129.413 9.719 -13.316
## accommodates 20.644 1.238 16.678
## bathrooms 12.254 2.090 5.863
## bedrooms 53.083 2.562 20.721
## latitude 1303.506 304.086 4.287
## longitude -84.964 209.045 -0.406
## Pr(>|t|)
## (Intercept) 0.040301 *
## neighbourhood_cleansedBernal Heights 0.725922
## neighbourhood_cleansedHaight Ashbury 0.659458
## neighbourhood_cleansedMission 0.148669
## neighbourhood_cleansedPotrero Hill 0.079309 .
## neighbourhood_cleansedNob Hill 0.000187 ***
## neighbourhood_cleansedMarina 0.516987
## neighbourhood_cleansedDowntown/Civic Center 0.538998
## neighbourhood_cleansedCastro/Upper Market 0.000507 ***
## neighbourhood_cleansedInner Sunset 0.112474
## neighbourhood_cleansedSouth of Market 0.254161
## neighbourhood_cleansedNoe Valley 0.000521 ***
## neighbourhood_cleansedPacific Heights 0.079541 .
## neighbourhood_cleansedPresidio Heights 0.594948
## neighbourhood_cleansedGlen Park 0.186675
## neighbourhood_cleansedTwin Peaks 0.011005 *
## neighbourhood_cleansedOcean View 0.985143
## neighbourhood_cleansedFinancial District 0.004114 **
## neighbourhood_cleansedOuter Richmond 0.004257 **
## neighbourhood_cleansedRussian Hill 0.542722
## neighbourhood_cleansedOuter Sunset 0.070746 .
## neighbourhood_cleansedNorth Beach 0.087054 .
## neighbourhood_cleansedInner Richmond 0.000842 ***
## neighbourhood_cleansedExcelsior 0.495724
## neighbourhood_cleansedSeacliff 0.002150 **
## neighbourhood_cleansedChinatown 0.167221
## neighbourhood_cleansedWest of Twin Peaks 0.176885
## neighbourhood_cleansedBayview 0.901693
## neighbourhood_cleansedDiamond Heights 0.264249
## neighbourhood_cleansedOuter Mission 0.440234
## neighbourhood_cleansedParkside 0.174490
## neighbourhood_cleansedGolden Gate Park 0.747055
## neighbourhood_cleansedLakeshore 0.924269
## neighbourhood_cleansedCrocker Amazon 0.915146
## neighbourhood_cleansedVisitacion Valley 0.799806
## neighbourhood_cleansedPresidio 0.436241
## room_typePrivate room < 2e-16 ***
## room_typeShared room < 2e-16 ***
## accommodates < 2e-16 ***
## bathrooms 4.81e-09 ***
## bedrooms < 2e-16 ***
## latitude 1.85e-05 ***
## longitude 0.684435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.83 on 5315 degrees of freedom
## Multiple R-squared: 0.5049, Adjusted R-squared: 0.501
## F-statistic: 129 on 42 and 5315 DF, p-value: < 2.2e-16
The model has taken the categorical variables and assigned dummy variables for each observation. It looks like not many of neighborhood_cleansed
dummy variables are significant. Interestingly, latitude
has a much larger effect than longitude
.
Reviews can reveal information about each listing that cannot be found in quantitative data such as the number of bedrooms
or even amenities
.
# Import datasets
sentiments <- read_csv("data/processed/sentiments.csv")
listings <- read_csv("data/processed/listings.csv")
The cleaned sentiments
dataset is composed of word
and sentiment
for all of the words in each review. The reviews are identified by the reviewer_id
. Each review spans across many rows.
sentiments %>% head() %>% kable()
listing_id | date | reviewer_id | word | sentiment |
---|---|---|---|---|
205842 | 2012-07-22 | 2936612 | excellent | positive |
205842 | 2012-07-22 | 2936612 | charming | positive |
205842 | 2012-07-22 | 2936612 | sparkling | positive |
205842 | 2012-07-22 | 2936612 | clean | positive |
205842 | 2012-07-22 | 2936612 | well | positive |
205842 | 2012-07-22 | 2936612 | easy | positive |
It would be interesting to see what the most frequent positive and negative words in the reviews are.
# top positive words
positive <- sentiments %>%
filter(sentiment == "positive") %>%
count(word, sentiment, sort = TRUE) %>%
select(-sentiment)
# top negative words
negative <- sentiments %>%
filter(sentiment == "negative") %>%
count(word, sentiment, sort = TRUE) %>%
select(-sentiment)
kable(list(head(positive, 10), head(negative, 10)), align='c')
|
|
The most frequent positive words are on the left, and the most frequent negative words are on the right of the table above. In my opinion, these words are categorized correctly, and make sense. In the positive column, it seems that clean and comfortable Airbnb’s receive positive reviews. In the negative column, it appears that Airbnb’s that are loud and cold receive negative reviews.
I will now count the number of positive and negative words, then calculate the sentiment score per review by subtracting the number of negative reviews from the number of positive reviews.
# sentiments per review
listing_sent <- sentiments %>%
# group by reviewer
group_by(reviewer_id) %>%
# count positive and negatve sentiments
count(listing_id, sentiment) %>%
# make new positive and negative count columns
spread(key = sentiment, value = n, fill = 0) %>%
# calculate overall sentiment per review
mutate(sent = positive - negative)
listing_sent %>% head() %>% kable()
reviewer_id | listing_id | negative | positive | sent |
---|---|---|---|---|
1 | 288213 | 0 | 8 | 8 |
1 | 1855096 | 1 | 10 | 9 |
1 | 2933105 | 0 | 11 | 11 |
3 | 9225 | 0 | 4 | 4 |
3 | 12522 | 0 | 8 | 8 |
3 | 14125 | 2 | 5 | 3 |
Then, I’ll find the average sentiment per review and join the dataset with the original listings
.
# average sentiments per listing
listing_sent <- listing_sent %>%
# group by listing
group_by(listing_id) %>%
# average sentiment per listing
summarize(avg_sent = mean(sent)) %>%
# join with original dataset
left_join(listings, by = c("listing_id" = "id"))
listing_sent %>%
arrange(desc(avg_sent)) %>%
select(listing_id:number_of_reviews, review_scores_rating) %>%
head(10) %>% kable()
listing_id | avg_sent | price | number_of_reviews | review_scores_rating |
---|---|---|---|---|
4250927 | 9.968254 | 239 | 63 | 100 |
271505 | 9.777108 | 160 | 182 | 99 |
413663 | 9.661765 | 189 | 72 | 99 |
5325355 | 9.490196 | 200 | 73 | 99 |
2026910 | 9.445545 | 550 | 107 | 99 |
715754 | 9.421053 | 160 | 78 | 99 |
856123 | 9.315789 | 168 | 62 | 100 |
377452 | 8.880952 | 200 | 85 | 99 |
27025 | 8.857143 | 175 | 129 | 100 |
734839 | 8.759036 | 185 | 172 | 99 |
Above are the top 10 listings with the best reviews based on sentiment analysis. The review_scores_rating
is the average score given by reviewers. The scores seem to match the sentiments of the reviews well. A better way to see if there is a trend would be to plot the data.
listing_sent %>%
ggplot(aes(x = as.factor(review_scores_rating), y = avg_sent)) +
geom_boxplot() +
labs(title = "Average Sentiment vs Review Scores",
x = "Review Scores", y = "Average Sentiment")
There is a strong positive relationship between the average sentiment and review scores. This indicates that the sentiment analysis was successful.
Intuitively, I would expect more positive reviews to be correlated with higher prices.
listing_sent %>%
ggplot(aes(x = price, y = avg_sent)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
labs(title = "Average Sentiment Score vs Price",
x = "Price", y = "Average Sentiment Score")
Above is a plot of price
on the x-axis and avg_sent
, the average sentiment score, on the y-axis. It seems that there is a positive relationship between them. It might help to take a look at where the points are more concentrated - lower prices.
listing_sent %>%
filter(price <= 250) %>%
ggplot(aes(x = price, y = avg_sent)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
labs(title = "Average Sentiment Score vs Price",
x = "Price", y = "Average Sentiment Score")
Above is the same plot as before, for listings priced at less than $250. The linear trend is a lot stronger now, indicated that there is a positive relationship between price and reviews.
I have explored data on Airbnb listings and examined multiple variables that affect price
. The ones that seem to affect it the most are room_type
and bedrooms
. Surprisingly, there is not a clear pattern in the location variables. This may be because the variables in the data are not the best for capturing location data. Unfortunately, a linear model does not fit the data well.
I have also examined the sentiments of reviews for the listings. The sentiment analysis matched the scores given by the reviewers well.