Overview

Airbnb is a global online hospitality service that connects travellers with lodging from local homeowners. The website and mobile application provide a marketplace for individuals to book or offer rooms. With services across numerous cities across the globe, Airbnb contains massive amounts of data on thousands of listings per region. In particular, San Francisco is where Airbnb was founded, and remains the location of its headquarters. It is a burgeoning city undergoing an economic boom due to the technology industry, and it would be interesting to observe rental housing prices under these circumstances.

Data

The data used for this analysis is obtained from Inside Airbnb, a website that provides data scraped from public listings on the Airbnb website. I will be analyzing San Francisco data that was compiled on October 03, 2018. In particular, I will be using the “Detailed Listings data for San Francisco” with a file name of “listings.csv.gz”, and “Detailed Review Data for listings in San Francisco” with a file name of “reviews.csv.gz”. These datasets contain information about the same listings, so I will join them together by the variable listing_id.

  1. Murray Cox, 2018, “Detailed Listings data for San Francisco”, Inside Airbnb, http://insideairbnb.com/get-the-data.html
  2. Murray Cox, 2018, “Detailed Review Data for listings in San Francisco”, Inside Airbnb, http://insideairbnb.com/get-the-data.html

Goals

The primary goal of this analysis is to identify the most important factors that affect housing prices, and attempt to predict prices based on these variables. Additionally, I would like to perform sentiment analysis on the reviews, and provide summary statistics of reviews for each listing.

Analysis

Housing Prices

Load Packages

# Load packages
library(tidyverse)
library(modelr)
library(caret)
library(ggmap)
library(broom)
library(knitr)

Prices Dataset

# Import datasets
prices <- read_csv(file = ("data/processed/prices.csv"), col_types = cols("n", col_factor(NULL), col_factor(NULL), "n", "n", "n", "c", "n", "n"))

After importing prices, a processed dataset that contains listing prices and several important factors that may go into daily rates, I will first take a look at the composition of the data.

prices %>% glimpse()
## Observations: 6,695
## Variables: 9
## $ price                  <dbl> 170, 235, 65, 65, 785, 255, 139, 135, 2...
## $ neighbourhood_cleansed <fct> Western Addition, Bernal Heights, Haigh...
## $ room_type              <fct> Entire home/apt, Entire home/apt, Priva...
## $ accommodates           <dbl> 3, 5, 2, 2, 5, 6, 3, 2, 6, 2, 4, 3, 5, ...
## $ bathrooms              <dbl> 1.0, 1.0, 4.0, 4.0, 1.5, 1.0, 1.0, 1.0,...
## $ bedrooms               <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2, 0, 3, 3, 3, ...
## $ amenities              <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen...
## $ latitude               <dbl> 37.76931, 37.74511, 37.76669, 37.76487,...
## $ longitude              <dbl> -122.4339, -122.4210, -122.4525, -122.4...

prices contains 9 columns:

  • price: daily base rate for booking the room
  • neighbourhood_cleansed: neighborhood of listing (36 total)
  • room_type: type of room (3 total)
  • accomodates: number of individuals that can be accomodated
  • bathrooms: number of bathrooms
  • bedrooms: number of bedrooms
  • amenities: amenities available
  • latitude, longitude: the latitude or longitude of the Airbnb

I will examine each variable and see whether there are any interesting relationships.

Location

Since individuals generally choose a hotel based on where they are travelling, I believe that neighbourhood_cleansed, longitude, and latitude are very influential on price. The more popular tourist areas are most likely more expensive.

I sorted the 36 neighborhoods in San Francisco based on their average housing price into 6 groups. Below are the groupings for reference.

# Find avg price per neighborhood
prices_neigh <- prices %>% 
  group_by(neighbourhood_cleansed) %>% 
  mutate(avg_price = round(mean(price), 2))

# Label neighborhoods by price range
prices_neigh[["neigh_cat"]] <- cut(prices_neigh$avg_price, 6, labels = c("Cheapest", "Cheap", "Moderately Cheap", "Moderately Expensive", "Expensive", "Most Expensive"))

# Neighborhood groups
prices_neigh %>% 
  select(neighbourhood_cleansed, neigh_cat, avg_price) %>% 
  group_by(neighbourhood_cleansed) %>% 
  unique() %>% 
  arrange(neigh_cat, avg_price)


The average price per neighborhood ranges from as low as $105 per night for the cheapest neighborhoods to $287.83 per night for the most expensive.

Now, let’s look at price for each neighbourhood_cleansed and accomodates pair to see what the general price an Airbnb would be for different numbers of people.

prices_neigh %>% 
  filter(accommodates <= 6) %>% 
  ggplot(aes(x = as.factor(accommodates), y = fct_reorder(neighbourhood_cleansed, avg_price))) + 
  geom_tile(aes(fill = price)) + 
  scale_fill_gradientn(colors = c("#c1e8ff", "#002b42")) + 
  labs(title = "Price vs Neighborhood Group and Accommodates", 
       x = "Accommodates", y = "Neighborhood Group")


The plot above shows neighbourhood_cleansed in ascending order of average price, and accommodates. The color represents price. Generally, the color is getting darker as accommodates and neighbourhood_cleansed increase. This indicates that for more expensive tourist areas and more people, Airbnb’s tend to be more expensive. However, the trend does not appear to be very clear.

To visualize the individuals listings, I’ll overlay the listings on a map of San Francisco and color the points by price.

sf <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("toner-lines"), zoom = 13)

ggmap(sf) + 
  geom_point(data = prices, aes(x = longitude, y = latitude, color = price), alpha = 0.7) +
  scale_colour_gradient(low = '#a4faff', high = '#00275b') +
  labs(title = "Prices by Location")


Based on this map, it seems as though most of the Airbnb listings are clustered in the center of the city and towards the northeast. This makes sense, since the center region is the heart of downtown San Francisco, and the northeast is next to the bay, which has many tourist attractions.

Rooms: Number and Type

There are three types of rooms:

  1. Entire home/apt indicates that the listing is for a complete house or apartment for the guest
  2. Private room indicates that the listing is for a room within a house or apartment that may be occupied by others
  3. Shared room indicates that the room is shared with other individuals.


Intuitively, I would expect that the more private the listing, the higher the price. Additionally, I would expect a higher number of bedrooms to be more expensive.

prices %>% 
  filter(bedrooms <= 5) %>% 
  group_by(room_type) %>% 
  mutate(avg_price = mean(price)) %>% 
  ggplot(aes(x = room_type, color = as.factor(bedrooms))) +
  geom_boxplot(aes(y = price)) + 
  geom_point(aes(y = avg_price, size = 1), color = "red") +
  labs(title = "Price vs Room Type by Number of Bedrooms", x = "Room Type", y = "Price", color = "Number of Bedrooms", size = "Average Price")


The plot above confirms our hypotheses. The red points indicate the average price for each room_type. I can see a clear decreasing pattern. There are also boxplots for number of bedrooms, represented by different colors. For each room_type, the price increases as bedrooms increases.

I also have data on the number of bathrooms. Whole bathrooms that include a bath or shower are counted as one, while half bathrooms only have a toilet and a sink. It seems as though bathrooms would be dependent on bedrooms, so it’s correlation to price would come from bedrooms instead of itself.

prices %>% 
  filter(bedrooms <= 5, bathrooms <= 5) %>% 
  ggplot(aes(x = as.factor(bathrooms), y = price)) +
  geom_boxplot() + 
  facet_wrap(~bedrooms) +
  labs(title = "Price vs Number of Bathrooms per Number of Bedrooms", x = "Number of Bathrooms", y = "Price")

Looking at the plot above, which displays price vs bathrooms faceted by bedrooms, it seems as though there is not relationship between bathrooms and price for each number of bedrooms. Only at 4 bedrooms is there a slightly positive correlation between bathrooms and price. For the other number of bedrooms, they are not positively correlated, so I conclude that bathrooms does not affect price.

Number of People

I will now look at the number of individuals accommodated by one listing and view the relationship with price. I expect a positive relationship, since generally, more individuals would require a larger space and more expensive house.

prices %>% 
  ggplot(aes(x = as.factor(accommodates), y = price)) +
  geom_boxplot(aes(color = room_type)) +
  geom_smooth(aes(x = accommodates), se = FALSE, color = "purple") + 
  labs(title = "Price vs Number Accommodated", x = "Number Accommodated", y = "Price", color = "Room Type")


The boxplots above represent price for the number of people accommodated. For each number of people, there is a different boxplot based on room_type. There is a positive relationship until accommodates reaches 9 people. This may be caused by a smaller number of observations that easily affect the trend. We still see the same relationship between room_type and price.

Amenities

To work with amenities, which is a string, I will first convert it to a vector of lists and replace it in prices.

# Convert amenities from char to list
amen <- prices$amenities
amen_list <- vector("list", length(amen))
for (i in seq_along(amen)){
  # remove punctuation and extra characters
  amen_list[[i]] = str_remove_all(amen[[i]], pattern = '[^([A-Z][a-z][0-9]|,|\'|\\s)]') %>%
    # split by comma
    str_split(pattern = ",")
}
prices[["amenities"]] <- amen_list

Now each Airbnb listing has a list of amenities. Next, I want to look at the 15 most frequently listed amenities.

# Count frequency
amen_freq <- prices$amenities %>% 
  unlist() %>% 
  table()

# Remove unwanted rows
amen_freq <- amen_freq[-c(1, 168, 169)]

# Top 15
amen_freq %>%
  sort(decreasing = TRUE) %>% 
  head(15) %>% kable(col.names = c("Amenities", "Freq"), align='c')
Amenities Freq
Wifi 6607
Essentials 6488
Heating 6312
Smoke detector 6241
Hangers 5995
Hair dryer 5688
Kitchen 5661
Shampoo 5630
Iron 5484
Laptop friendly workspace 5411
Carbon monoxide detector 5384
TV 5288
Washer 4736
Dryer 4728
Fire extinguisher 4614

Many of these amenities seem reasonable, as they are common in hotels. I would expect there to be “Wifi”, “Shampoo”, and a “TV”, since these are pretty standard for the lodging industry. Some interesting ones are “Smoke detector”, “Carbon monoxide detector”, and “Fire extinguisher”. I would not have expected those to be listed as amenities. However, it makes sense, since they are legally required to be installed in every house or apartment in San Francisco. The reason that they are not the most common amenities may be that some people do not consider them to be amenities, so they have not listed them on Airbnb. However, I expect those to be the most common amenities if they were listed by every Airbnb host that provides them.

It would be interesting to see how amenities differs for expensive and cheap Airbnb listings.

# Split into two groups by price
amen_prices <- prices %>% 
  mutate(above_average = ifelse(price > median(price), TRUE, FALSE))

# Amenities per group
expensive_amenities <- amen_prices %>% 
  filter(above_average == TRUE) %>% 
  select(amenities) %>% 
  unlist() %>% 
  table() %>% 
  sort(decreasing = TRUE)

cheap_amenities <- amen_prices %>% 
  filter(above_average == FALSE) %>% 
  select(amenities) %>% 
  unlist() %>% 
  table() %>% 
  sort(decreasing = TRUE)
kable(list(head(expensive_amenities, 10), head(cheap_amenities, 10)), col.names = c("Amenities", "Freq"), align='c')
Amenities Freq
Wifi 3255
Essentials 3197
Heating 3146
Smoke detector 3067
Hangers 3005
Kitchen 2962
Shampoo 2932
TV 2926
Hair dryer 2921
Iron 2875
Amenities Freq
Wifi 3352
Essentials 3291
Smoke detector 3174
Heating 3166
Hangers 2990
Hair dryer 2767
Kitchen 2699
Shampoo 2698
Carbon monoxide detector 2631
Iron 2609

The above table shows the top 10 most frequently listed amenities for listings at above median prices (left) and below median prices(right). There does not seem to be a large difference in the amenities. This may be because price differs more on quality than simply which amenities are offered. Another explanation is that common amenities are provided by essentially all Airbnb hosts. Lastly, amenities may not be that important when considering price. The more important factors may be location or size.

Modelling

I have examined all the variables in the prices dataset, so I will move on to attempting to predict price. I will test a few different linear models and determine which one best fits prices, then apply that model to a testing dataset and see how accurate the model is.

# Model dataset
dataset <- prices %>% 
  select(-amenities)

# Train-Test Split
set.seed(100)
pindex <- createDataPartition(dataset$price, p = 0.8, list = FALSE)
ptrain <- dataset[pindex,]
ptest <- dataset[-pindex,]

First, split the dataset into train and test sets. 80% of the data is in the ptrain set, and 20% is in ptest.

# Linear models
# all data
price_lm_1 <- ptrain %>% 
  lm(price ~ ., data = .)
price_lm_1$call
## lm(formula = price ~ ., data = .)
glance(price_lm_1) %>% kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
value 0.5048863 0.5009739 98.82789 129.0455 0 43 -32192.42 64472.83 64762.63 51911352 5315
# all except lat and lon
price_lm_2 <- ptrain %>% 
  lm(price ~ neighbourhood_cleansed + room_type + accommodates
     + bathrooms + bedrooms, data = .)
price_lm_2$call
## lm(formula = price ~ neighbourhood_cleansed + room_type + accommodates + 
##     bathrooms + bedrooms, data = .)
glance(price_lm_2) %>% kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
value 0.5031746 0.499437 98.97996 134.6237 0 41 -32201.66 64487.32 64763.95 52090823 5317
# all except neighborhood
price_lm_3 <- ptrain %>% 
  lm(price ~ room_type + accommodates + bathrooms + 
       bedrooms + latitude + longitude, data = .)
price_lm_3$call
## lm(formula = price ~ room_type + accommodates + bathrooms + bedrooms + 
##     latitude + longitude, data = .)
glance(price_lm_3) %>% kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
value 0.4871795 0.4865085 100.25 726.0715 0 8 -32286.55 64591.1 64650.38 53767868 5350

Above are the model formulas and their summary values. The first model has the highest R2 value.

# Add predictions and residuals
ptest <- ptest %>% 
  add_predictions(model = price_lm_1, var = "1_pred") %>% 
  add_predictions(model = price_lm_2, var = "2_pred") %>%
  add_predictions(model = price_lm_3, var = "3_pred") %>%
  add_residuals(model = price_lm_1, var = "1_resid") %>% 
  add_residuals(model = price_lm_2, var = "2_resid") %>% 
  add_residuals(model = price_lm_3, var = "3_resid")

kable(list(sum(ptest$`1_resid`^2),
  sum(ptest$`2_resid`^2),
  sum(ptest$`3_resid`^2)), col.names = "Sum of Squared Residuals")
Sum of Squared Residuals
12163009
Sum of Squared Residuals
12167983
Sum of Squared Residuals
12311770

Above are the sums of the squared residuals for each model. The first model has the lowest value. However, none of the models have a particularly good fit. Thus, a linear model may not be the best choice in attempting to predict price. I will still take a look at the first model to see what relationships exist between price and the other variables.

summary(price_lm_1)
## 
## Call:
## lm(formula = price ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -757.18  -51.87  -13.30   30.51  751.13 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -59576.070  29045.197  -2.051
## neighbourhood_cleansedBernal Heights             5.004     14.275   0.351
## neighbourhood_cleansedHaight Ashbury            -3.829      8.689  -0.441
## neighbourhood_cleansedMission                   13.975      9.675   1.444
## neighbourhood_cleansedPotrero Hill              23.736     13.525   1.755
## neighbourhood_cleansedNob Hill                 -38.823     10.386  -3.738
## neighbourhood_cleansedMarina                     7.794     12.027   0.648
## neighbourhood_cleansedDowntown/Civic Center      5.343      8.696   0.614
## neighbourhood_cleansedCastro/Upper Market       30.454      8.753   3.479
## neighbourhood_cleansedInner Sunset             -21.247     13.384  -1.587
## neighbourhood_cleansedSouth of Market           10.800      9.470   1.140
## neighbourhood_cleansedNoe Valley                40.644     11.707   3.472
## neighbourhood_cleansedPacific Heights           19.162     10.927   1.754
## neighbourhood_cleansedPresidio Heights         -10.918     20.534  -0.532
## neighbourhood_cleansedGlen Park                 25.189     19.073   1.321
## neighbourhood_cleansedTwin Peaks                41.603     16.357   2.543
## neighbourhood_cleansedOcean View                -0.405     21.746  -0.019
## neighbourhood_cleansedFinancial District        40.325     14.048   2.871
## neighbourhood_cleansedOuter Richmond           -47.637     16.658  -2.860
## neighbourhood_cleansedRussian Hill               7.680     12.616   0.609
## neighbourhood_cleansedOuter Sunset             -28.895     15.986  -1.807
## neighbourhood_cleansedNorth Beach              -23.265     13.593  -1.711
## neighbourhood_cleansedInner Richmond           -37.777     11.309  -3.340
## neighbourhood_cleansedExcelsior                -13.396     19.664  -0.681
## neighbourhood_cleansedSeacliff                 -86.969     28.326  -3.070
## neighbourhood_cleansedChinatown                 18.582     13.451   1.381
## neighbourhood_cleansedWest of Twin Peaks        23.744     17.581   1.351
## neighbourhood_cleansedBayview                   -2.601     21.056  -0.124
## neighbourhood_cleansedDiamond Heights           32.365     28.987   1.117
## neighbourhood_cleansedOuter Mission             14.628     18.952   0.772
## neighbourhood_cleansedParkside                 -25.496     18.773  -1.358
## neighbourhood_cleansedGolden Gate Park          13.588     42.129   0.323
## neighbourhood_cleansedLakeshore                 -2.232     23.480  -0.095
## neighbourhood_cleansedCrocker Amazon             3.034     28.472   0.107
## neighbourhood_cleansedVisitacion Valley          6.404     25.249   0.254
## neighbourhood_cleansedPresidio                 -77.806     99.929  -0.779
## room_typePrivate room                          -39.839      3.225 -12.353
## room_typeShared room                          -129.413      9.719 -13.316
## accommodates                                    20.644      1.238  16.678
## bathrooms                                       12.254      2.090   5.863
## bedrooms                                        53.083      2.562  20.721
## latitude                                      1303.506    304.086   4.287
## longitude                                      -84.964    209.045  -0.406
##                                             Pr(>|t|)    
## (Intercept)                                 0.040301 *  
## neighbourhood_cleansedBernal Heights        0.725922    
## neighbourhood_cleansedHaight Ashbury        0.659458    
## neighbourhood_cleansedMission               0.148669    
## neighbourhood_cleansedPotrero Hill          0.079309 .  
## neighbourhood_cleansedNob Hill              0.000187 ***
## neighbourhood_cleansedMarina                0.516987    
## neighbourhood_cleansedDowntown/Civic Center 0.538998    
## neighbourhood_cleansedCastro/Upper Market   0.000507 ***
## neighbourhood_cleansedInner Sunset          0.112474    
## neighbourhood_cleansedSouth of Market       0.254161    
## neighbourhood_cleansedNoe Valley            0.000521 ***
## neighbourhood_cleansedPacific Heights       0.079541 .  
## neighbourhood_cleansedPresidio Heights      0.594948    
## neighbourhood_cleansedGlen Park             0.186675    
## neighbourhood_cleansedTwin Peaks            0.011005 *  
## neighbourhood_cleansedOcean View            0.985143    
## neighbourhood_cleansedFinancial District    0.004114 ** 
## neighbourhood_cleansedOuter Richmond        0.004257 ** 
## neighbourhood_cleansedRussian Hill          0.542722    
## neighbourhood_cleansedOuter Sunset          0.070746 .  
## neighbourhood_cleansedNorth Beach           0.087054 .  
## neighbourhood_cleansedInner Richmond        0.000842 ***
## neighbourhood_cleansedExcelsior             0.495724    
## neighbourhood_cleansedSeacliff              0.002150 ** 
## neighbourhood_cleansedChinatown             0.167221    
## neighbourhood_cleansedWest of Twin Peaks    0.176885    
## neighbourhood_cleansedBayview               0.901693    
## neighbourhood_cleansedDiamond Heights       0.264249    
## neighbourhood_cleansedOuter Mission         0.440234    
## neighbourhood_cleansedParkside              0.174490    
## neighbourhood_cleansedGolden Gate Park      0.747055    
## neighbourhood_cleansedLakeshore             0.924269    
## neighbourhood_cleansedCrocker Amazon        0.915146    
## neighbourhood_cleansedVisitacion Valley     0.799806    
## neighbourhood_cleansedPresidio              0.436241    
## room_typePrivate room                        < 2e-16 ***
## room_typeShared room                         < 2e-16 ***
## accommodates                                 < 2e-16 ***
## bathrooms                                   4.81e-09 ***
## bedrooms                                     < 2e-16 ***
## latitude                                    1.85e-05 ***
## longitude                                   0.684435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.83 on 5315 degrees of freedom
## Multiple R-squared:  0.5049, Adjusted R-squared:  0.501 
## F-statistic:   129 on 42 and 5315 DF,  p-value: < 2.2e-16

The model has taken the categorical variables and assigned dummy variables for each observation. It looks like not many of neighborhood_cleansed dummy variables are significant. Interestingly, latitude has a much larger effect than longitude.

Reviews

Reviews can reveal information about each listing that cannot be found in quantitative data such as the number of bedrooms or even amenities.

Sentiments Dataset

# Import datasets
sentiments <- read_csv("data/processed/sentiments.csv")
listings <- read_csv("data/processed/listings.csv")

The cleaned sentiments dataset is composed of word and sentiment for all of the words in each review. The reviews are identified by the reviewer_id. Each review spans across many rows.

sentiments %>% head() %>% kable()
listing_id date reviewer_id word sentiment
205842 2012-07-22 2936612 excellent positive
205842 2012-07-22 2936612 charming positive
205842 2012-07-22 2936612 sparkling positive
205842 2012-07-22 2936612 clean positive
205842 2012-07-22 2936612 well positive
205842 2012-07-22 2936612 easy positive

Top Words

It would be interesting to see what the most frequent positive and negative words in the reviews are.

# top positive words
positive <- sentiments %>% 
  filter(sentiment == "positive") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  select(-sentiment)

# top negative words
negative <- sentiments %>% 
  filter(sentiment == "negative") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  select(-sentiment)

kable(list(head(positive, 10), head(negative, 10)), align='c')
word n
great 149096
clean 71224
nice 56583
comfortable 50816
recommend 42847
easy 39017
perfect 37257
well 35207
good 34121
quiet 30475
word n
die 5350
noise 4450
problem 4284
issue 2794
issues 2149
hard 1968
bad 1957
cold 1810
noisy 1691
tout 1670

The most frequent positive words are on the left, and the most frequent negative words are on the right of the table above. In my opinion, these words are categorized correctly, and make sense. In the positive column, it seems that clean and comfortable Airbnb’s receive positive reviews. In the negative column, it appears that Airbnb’s that are loud and cold receive negative reviews.

Sentiment Scores

I will now count the number of positive and negative words, then calculate the sentiment score per review by subtracting the number of negative reviews from the number of positive reviews.

# sentiments per review
listing_sent <- sentiments %>% 
  # group by reviewer
  group_by(reviewer_id) %>% 
  # count positive and negatve sentiments
  count(listing_id, sentiment) %>% 
  # make new positive and negative count columns
  spread(key = sentiment, value = n, fill = 0) %>%  
  # calculate overall sentiment per review
  mutate(sent = positive - negative)

listing_sent %>% head() %>% kable()
reviewer_id listing_id negative positive sent
1 288213 0 8 8
1 1855096 1 10 9
1 2933105 0 11 11
3 9225 0 4 4
3 12522 0 8 8
3 14125 2 5 3

Then, I’ll find the average sentiment per review and join the dataset with the original listings.

# average sentiments per listing
listing_sent <- listing_sent %>% 
  # group by listing
  group_by(listing_id) %>% 
  # average sentiment per listing
  summarize(avg_sent = mean(sent)) %>%
  # join with original dataset
  left_join(listings, by = c("listing_id" = "id"))

listing_sent %>% 
  arrange(desc(avg_sent)) %>% 
  select(listing_id:number_of_reviews, review_scores_rating) %>% 
  head(10) %>% kable()
listing_id avg_sent price number_of_reviews review_scores_rating
4250927 9.968254 239 63 100
271505 9.777108 160 182 99
413663 9.661765 189 72 99
5325355 9.490196 200 73 99
2026910 9.445545 550 107 99
715754 9.421053 160 78 99
856123 9.315789 168 62 100
377452 8.880952 200 85 99
27025 8.857143 175 129 100
734839 8.759036 185 172 99

Above are the top 10 listings with the best reviews based on sentiment analysis. The review_scores_rating is the average score given by reviewers. The scores seem to match the sentiments of the reviews well. A better way to see if there is a trend would be to plot the data.

listing_sent %>% 
  ggplot(aes(x = as.factor(review_scores_rating), y = avg_sent)) +
  geom_boxplot() +
  labs(title = "Average Sentiment vs Review Scores",
       x = "Review Scores", y = "Average Sentiment")


There is a strong positive relationship between the average sentiment and review scores. This indicates that the sentiment analysis was successful.

Price

Intuitively, I would expect more positive reviews to be correlated with higher prices.

listing_sent %>% 
  ggplot(aes(x = price, y = avg_sent)) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") + 
  labs(title = "Average Sentiment Score vs Price",
       x = "Price", y = "Average Sentiment Score")


Above is a plot of price on the x-axis and avg_sent, the average sentiment score, on the y-axis. It seems that there is a positive relationship between them. It might help to take a look at where the points are more concentrated - lower prices.

listing_sent %>% 
  filter(price <= 250) %>% 
  ggplot(aes(x = price, y = avg_sent)) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") + 
  labs(title = "Average Sentiment Score vs Price",
       x = "Price", y = "Average Sentiment Score")


Above is the same plot as before, for listings priced at less than $250. The linear trend is a lot stronger now, indicated that there is a positive relationship between price and reviews.

Conclusion

I have explored data on Airbnb listings and examined multiple variables that affect price. The ones that seem to affect it the most are room_type and bedrooms. Surprisingly, there is not a clear pattern in the location variables. This may be because the variables in the data are not the best for capturing location data. Unfortunately, a linear model does not fit the data well.

I have also examined the sentiments of reviews for the listings. The sentiment analysis matched the scores given by the reviewers well.