Analysis

Housing Prices

Load Packages

# Load packages
library(tidyverse)
library(modelr)
library(caret)
library(ggmap)
library(broom)
library(knitr)

Prices Dataset

# Import datasets
prices <- read_csv(file = ("data/processed/prices.csv"), col_types = cols("n", col_factor(NULL), col_factor(NULL), "n", "n", "n", "c", "n", "n"))

After importing prices, a processed dataset that contains listing prices and several important factors that may go into daily rates, I will first take a look at the composition of the data.

prices %>% glimpse()

## Observations: 6,695
## Variables: 9
## $ price                  <dbl> 170, 235, 65, 65, 785, 255, 139, 135, 2...
## $ neighbourhood_cleansed <fct> Western Addition, Bernal Heights, Haigh...
## $ room_type              <fct> Entire home/apt, Entire home/apt, Priva...
## $ accommodates           <dbl> 3, 5, 2, 2, 5, 6, 3, 2, 6, 2, 4, 3, 5, ...
## $ bathrooms              <dbl> 1.0, 1.0, 4.0, 4.0, 1.5, 1.0, 1.0, 1.0,...
## $ bedrooms               <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2, 0, 3, 3, 3, ...
## $ amenities              <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen...
## $ latitude               <dbl> 37.76931, 37.74511, 37.76669, 37.76487,...
## $ longitude              <dbl> -122.4339, -122.4210, -122.4525, -122.4...

prices contains 9 columns:

price: daily base rate for booking the room
neighbourhood_cleansed: neighborhood of listing (36 total)
room_type: type of room (3 total)
accomodates: number of individuals that can be accomodated
bathrooms: number of bathrooms
bedrooms: number of bedrooms
amenities: amenities available
latitude, longitude: the latitude or longitude of the Airbnb

I will examine each variable and see whether there are any interesting relationships.

Location

Since individuals generally choose a hotel based on where they are travelling, I believe that neighbourhood_cleansed, longitude, and latitude are very influential on price. The more popular tourist areas are most likely more expensive.

I sorted the 36 neighborhoods in San Francisco based on their average housing price into 6 groups. Below are the groupings for reference.

# Find avg price per neighborhood
prices_neigh <- prices %>% 
  group_by(neighbourhood_cleansed) %>% 
  mutate(avg_price = round(mean(price), 2))

# Label neighborhoods by price range
prices_neigh[["neigh_cat"]] <- cut(prices_neigh$avg_price, 6, labels = c("Cheapest", "Cheap", "Moderately Cheap", "Moderately Expensive", "Expensive", "Most Expensive"))

# Neighborhood groups
prices_neigh %>% 
  select(neighbourhood_cleansed, neigh_cat, avg_price) %>% 
  group_by(neighbourhood_cleansed) %>% 
  unique() %>% 
  arrange(neigh_cat, avg_price)

The average price per neighborhood ranges from as low as $105 per night for the cheapest neighborhoods to $287.83 per night for the most expensive.

Now, let’s look at price for each neighbourhood_cleansed and accomodates pair to see what the general price an Airbnb would be for different numbers of people.

prices_neigh %>% 
  filter(accommodates <= 6) %>% 
  ggplot(aes(x = as.factor(accommodates), y = fct_reorder(neighbourhood_cleansed, avg_price))) + 
  geom_tile(aes(fill = price)) + 
  scale_fill_gradientn(colors = c("#c1e8ff", "#002b42")) + 
  labs(title = "Price vs Neighborhood Group and Accommodates", 
       x = "Accommodates", y = "Neighborhood Group")

The plot above shows neighbourhood_cleansed in ascending order of average price, and accommodates. The color represents price. Generally, the color is getting darker as accommodates and neighbourhood_cleansed increase. This indicates that for more expensive tourist areas and more people, Airbnb’s tend to be more expensive. However, the trend does not appear to be very clear.

To visualize the individuals listings, I’ll overlay the listings on a map of San Francisco and color the points by price.

sf <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("toner-lines"), zoom = 13)

ggmap(sf) + 
  geom_point(data = prices, aes(x = longitude, y = latitude, color = price), alpha = 0.7) +
  scale_colour_gradient(low = '#a4faff', high = '#00275b') +
  labs(title = "Prices by Location")

Based on this map, it seems as though most of the Airbnb listings are clustered in the center of the city and towards the northeast. This makes sense, since the center region is the heart of downtown San Francisco, and the northeast is next to the bay, which has many tourist attractions.

Rooms: Number and Type

There are three types of rooms:

Entire home/apt indicates that the listing is for a complete house or apartment for the guest
Private room indicates that the listing is for a room within a house or apartment that may be occupied by others
Shared room indicates that the room is shared with other individuals.

Intuitively, I would expect that the more private the listing, the higher the price. Additionally, I would expect a higher number of bedrooms to be more expensive.

prices %>% 
  filter(bedrooms <= 5) %>% 
  group_by(room_type) %>% 
  mutate(avg_price = mean(price)) %>% 
  ggplot(aes(x = room_type, color = as.factor(bedrooms))) +
  geom_boxplot(aes(y = price)) + 
  geom_point(aes(y = avg_price, size = 1), color = "red") +
  labs(title = "Price vs Room Type by Number of Bedrooms", x = "Room Type", y = "Price", color = "Number of Bedrooms", size = "Average Price")

The plot above confirms our hypotheses. The red points indicate the average price for each room_type. I can see a clear decreasing pattern. There are also boxplots for number of bedrooms, represented by different colors. For each room_type, the price increases as bedrooms increases.

I also have data on the number of bathrooms. Whole bathrooms that include a bath or shower are counted as one, while half bathrooms only have a toilet and a sink. It seems as though bathrooms would be dependent on bedrooms, so it’s correlation to price would come from bedrooms instead of itself.

prices %>% 
  filter(bedrooms <= 5, bathrooms <= 5) %>% 
  ggplot(aes(x = as.factor(bathrooms), y = price)) +
  geom_boxplot() + 
  facet_wrap(~bedrooms) +
  labs(title = "Price vs Number of Bathrooms per Number of Bedrooms", x = "Number of Bathrooms", y = "Price")

Looking at the plot above, which displays price vs bathrooms faceted by bedrooms, it seems as though there is not relationship between bathrooms and price for each number of bedrooms. Only at 4 bedrooms is there a slightly positive correlation between bathrooms and price. For the other number of bedrooms, they are not positively correlated, so I conclude that bathrooms does not affect price.

Number of People

I will now look at the number of individuals accommodated by one listing and view the relationship with price. I expect a positive relationship, since generally, more individuals would require a larger space and more expensive house.

prices %>% 
  ggplot(aes(x = as.factor(accommodates), y = price)) +
  geom_boxplot(aes(color = room_type)) +
  geom_smooth(aes(x = accommodates), se = FALSE, color = "purple") + 
  labs(title = "Price vs Number Accommodated", x = "Number Accommodated", y = "Price", color = "Room Type")

The boxplots above represent price for the number of people accommodated. For each number of people, there is a different boxplot based on room_type. There is a positive relationship until accommodates reaches 9 people. This may be caused by a smaller number of observations that easily affect the trend. We still see the same relationship between room_type and price.

Amenities

To work with amenities, which is a string, I will first convert it to a vector of lists and replace it in prices.

# Convert amenities from char to list
amen <- prices$amenities
amen_list <- vector("list", length(amen))
for (i in seq_along(amen)){
  # remove punctuation and extra characters
  amen_list[[i]] = str_remove_all(amen[[i]], pattern = '[^([A-Z][a-z][0-9]|,|\'|\\s)]') %>%
    # split by comma
    str_split(pattern = ",")
}
prices[["amenities"]] <- amen_list

Now each Airbnb listing has a list of amenities. Next, I want to look at the 15 most frequently listed amenities.

# Count frequency
amen_freq <- prices$amenities %>% 
  unlist() %>% 
  table()

# Remove unwanted rows
amen_freq <- amen_freq[-c(1, 168, 169)]

# Top 15
amen_freq %>%
  sort(decreasing = TRUE) %>% 
  head(15) %>% kable(col.names = c("Amenities", "Freq"), align='c')

Amenities	Freq
Wifi	6607
Essentials	6488
Heating	6312
Smoke detector	6241
Hangers	5995
Hair dryer	5688
Kitchen	5661
Shampoo	5630
Iron	5484
Laptop friendly workspace	5411
Carbon monoxide detector	5384
TV	5288
Washer	4736
Dryer	4728
Fire extinguisher	4614

Many of these amenities seem reasonable, as they are common in hotels. I would expect there to be “Wifi”, “Shampoo”, and a “TV”, since these are pretty standard for the lodging industry. Some interesting ones are “Smoke detector”, “Carbon monoxide detector”, and “Fire extinguisher”. I would not have expected those to be listed as amenities. However, it makes sense, since they are legally required to be installed in every house or apartment in San Francisco. The reason that they are not the most common amenities may be that some people do not consider them to be amenities, so they have not listed them on Airbnb. However, I expect those to be the most common amenities if they were listed by every Airbnb host that provides them.

It would be interesting to see how amenities differs for expensive and cheap Airbnb listings.

# Split into two groups by price
amen_prices <- prices %>% 
  mutate(above_average = ifelse(price > median(price), TRUE, FALSE))

# Amenities per group
expensive_amenities <- amen_prices %>% 
  filter(above_average == TRUE) %>% 
  select(amenities) %>% 
  unlist() %>% 
  table() %>% 
  sort(decreasing = TRUE)

cheap_amenities <- amen_prices %>% 
  filter(above_average == FALSE) %>% 
  select(amenities) %>% 
  unlist() %>% 
  table() %>% 
  sort(decreasing = TRUE)

kable(list(head(expensive_amenities, 10), head(cheap_amenities, 10)), col.names = c("Amenities", "Freq"), align='c')

Amenities	Freq
Wifi	3255
Essentials	3197
Heating	3146
Smoke detector	3067
Hangers	3005
Kitchen	2962
Shampoo	2932
TV	2926
Hair dryer	2921
Iron	2875

Amenities	Freq
Wifi	3352
Essentials	3291
Smoke detector	3174
Heating	3166
Hangers	2990
Hair dryer	2767
Kitchen	2699
Shampoo	2698
Carbon monoxide detector	2631
Iron	2609

The above table shows the top 10 most frequently listed amenities for listings at above median prices (left) and below median prices(right). There does not seem to be a large difference in the amenities. This may be because price differs more on quality than simply which amenities are offered. Another explanation is that common amenities are provided by essentially all Airbnb hosts. Lastly, amenities may not be that important when considering price. The more important factors may be location or size.

Modelling

I have examined all the variables in the prices dataset, so I will move on to attempting to predict price. I will test a few different linear models and determine which one best fits prices, then apply that model to a testing dataset and see how accurate the model is.

# Model dataset
dataset <- prices %>% 
  select(-amenities)

# Train-Test Split
set.seed(100)
pindex <- createDataPartition(dataset$price, p = 0.8, list = FALSE)
ptrain <- dataset[pindex,]
ptest <- dataset[-pindex,]

First, split the dataset into train and test sets. 80% of the data is in the ptrain set, and 20% is in ptest.

# Linear models
# all data
price_lm_1 <- ptrain %>% 
  lm(price ~ ., data = .)
price_lm_1$call

## lm(formula = price ~ ., data = .)

glance(price_lm_1) %>% kable()

	r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
value	0.5048863	0.5009739	98.82789	129.0455	0	43	-32192.42	64472.83	64762.63	51911352	5315

# all except lat and lon
price_lm_2 <- ptrain %>% 
  lm(price ~ neighbourhood_cleansed + room_type + accommodates
     + bathrooms + bedrooms, data = .)
price_lm_2$call

## lm(formula = price ~ neighbourhood_cleansed + room_type + accommodates + 
##     bathrooms + bedrooms, data = .)

glance(price_lm_2) %>% kable()

	r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
value	0.5031746	0.499437	98.97996	134.6237	0	41	-32201.66	64487.32	64763.95	52090823	5317

# all except neighborhood
price_lm_3 <- ptrain %>% 
  lm(price ~ room_type + accommodates + bathrooms + 
       bedrooms + latitude + longitude, data = .)
price_lm_3$call

## lm(formula = price ~ room_type + accommodates + bathrooms + bedrooms + 
##     latitude + longitude, data = .)

glance(price_lm_3) %>% kable()

	r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
value	0.4871795	0.4865085	100.25	726.0715	0	8	-32286.55	64591.1	64650.38	53767868	5350

Above are the model formulas and their summary values. The first model has the highest R² value.

# Add predictions and residuals
ptest <- ptest %>% 
  add_predictions(model = price_lm_1, var = "1_pred") %>% 
  add_predictions(model = price_lm_2, var = "2_pred") %>%
  add_predictions(model = price_lm_3, var = "3_pred") %>%
  add_residuals(model = price_lm_1, var = "1_resid") %>% 
  add_residuals(model = price_lm_2, var = "2_resid") %>% 
  add_residuals(model = price_lm_3, var = "3_resid")

kable(list(sum(ptest$`1_resid`^2),
  sum(ptest$`2_resid`^2),
  sum(ptest$`3_resid`^2)), col.names = "Sum of Squared Residuals")

Sum of Squared Residuals
12163009

Sum of Squared Residuals
12167983

Sum of Squared Residuals
12311770

Above are the sums of the squared residuals for each model. The first model has the lowest value. However, none of the models have a particularly good fit. Thus, a linear model may not be the best choice in attempting to predict price. I will still take a look at the first model to see what relationships exist between price and the other variables.

summary(price_lm_1)

## 
## Call:
## lm(formula = price ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -757.18  -51.87  -13.30   30.51  751.13 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -59576.070  29045.197  -2.051
## neighbourhood_cleansedBernal Heights             5.004     14.275   0.351
## neighbourhood_cleansedHaight Ashbury            -3.829      8.689  -0.441
## neighbourhood_cleansedMission                   13.975      9.675   1.444
## neighbourhood_cleansedPotrero Hill              23.736     13.525   1.755
## neighbourhood_cleansedNob Hill                 -38.823     10.386  -3.738
## neighbourhood_cleansedMarina                     7.794     12.027   0.648
## neighbourhood_cleansedDowntown/Civic Center      5.343      8.696   0.614
## neighbourhood_cleansedCastro/Upper Market       30.454      8.753   3.479
## neighbourhood_cleansedInner Sunset             -21.247     13.384  -1.587
## neighbourhood_cleansedSouth of Market           10.800      9.470   1.140
## neighbourhood_cleansedNoe Valley                40.644     11.707   3.472
## neighbourhood_cleansedPacific Heights           19.162     10.927   1.754
## neighbourhood_cleansedPresidio Heights         -10.918     20.534  -0.532
## neighbourhood_cleansedGlen Park                 25.189     19.073   1.321
## neighbourhood_cleansedTwin Peaks                41.603     16.357   2.543
## neighbourhood_cleansedOcean View                -0.405     21.746  -0.019
## neighbourhood_cleansedFinancial District        40.325     14.048   2.871
## neighbourhood_cleansedOuter Richmond           -47.637     16.658  -2.860
## neighbourhood_cleansedRussian Hill               7.680     12.616   0.609
## neighbourhood_cleansedOuter Sunset             -28.895     15.986  -1.807
## neighbourhood_cleansedNorth Beach              -23.265     13.593  -1.711
## neighbourhood_cleansedInner Richmond           -37.777     11.309  -3.340
## neighbourhood_cleansedExcelsior                -13.396     19.664  -0.681
## neighbourhood_cleansedSeacliff                 -86.969     28.326  -3.070
## neighbourhood_cleansedChinatown                 18.582     13.451   1.381
## neighbourhood_cleansedWest of Twin Peaks        23.744     17.581   1.351
## neighbourhood_cleansedBayview                   -2.601     21.056  -0.124
## neighbourhood_cleansedDiamond Heights           32.365     28.987   1.117
## neighbourhood_cleansedOuter Mission             14.628     18.952   0.772
## neighbourhood_cleansedParkside                 -25.496     18.773  -1.358
## neighbourhood_cleansedGolden Gate Park          13.588     42.129   0.323
## neighbourhood_cleansedLakeshore                 -2.232     23.480  -0.095
## neighbourhood_cleansedCrocker Amazon             3.034     28.472   0.107
## neighbourhood_cleansedVisitacion Valley          6.404     25.249   0.254
## neighbourhood_cleansedPresidio                 -77.806     99.929  -0.779
## room_typePrivate room                          -39.839      3.225 -12.353
## room_typeShared room                          -129.413      9.719 -13.316
## accommodates                                    20.644      1.238  16.678
## bathrooms                                       12.254      2.090   5.863
## bedrooms                                        53.083      2.562  20.721
## latitude                                      1303.506    304.086   4.287
## longitude                                      -84.964    209.045  -0.406
##                                             Pr(>|t|)    
## (Intercept)                                 0.040301 *  
## neighbourhood_cleansedBernal Heights        0.725922    
## neighbourhood_cleansedHaight Ashbury        0.659458    
## neighbourhood_cleansedMission               0.148669    
## neighbourhood_cleansedPotrero Hill          0.079309 .  
## neighbourhood_cleansedNob Hill              0.000187 ***
## neighbourhood_cleansedMarina                0.516987    
## neighbourhood_cleansedDowntown/Civic Center 0.538998    
## neighbourhood_cleansedCastro/Upper Market   0.000507 ***
## neighbourhood_cleansedInner Sunset          0.112474    
## neighbourhood_cleansedSouth of Market       0.254161    
## neighbourhood_cleansedNoe Valley            0.000521 ***
## neighbourhood_cleansedPacific Heights       0.079541 .  
## neighbourhood_cleansedPresidio Heights      0.594948    
## neighbourhood_cleansedGlen Park             0.186675    
## neighbourhood_cleansedTwin Peaks            0.011005 *  
## neighbourhood_cleansedOcean View            0.985143    
## neighbourhood_cleansedFinancial District    0.004114 ** 
## neighbourhood_cleansedOuter Richmond        0.004257 ** 
## neighbourhood_cleansedRussian Hill          0.542722    
## neighbourhood_cleansedOuter Sunset          0.070746 .  
## neighbourhood_cleansedNorth Beach           0.087054 .  
## neighbourhood_cleansedInner Richmond        0.000842 ***
## neighbourhood_cleansedExcelsior             0.495724    
## neighbourhood_cleansedSeacliff              0.002150 ** 
## neighbourhood_cleansedChinatown             0.167221    
## neighbourhood_cleansedWest of Twin Peaks    0.176885    
## neighbourhood_cleansedBayview               0.901693    
## neighbourhood_cleansedDiamond Heights       0.264249    
## neighbourhood_cleansedOuter Mission         0.440234    
## neighbourhood_cleansedParkside              0.174490    
## neighbourhood_cleansedGolden Gate Park      0.747055    
## neighbourhood_cleansedLakeshore             0.924269    
## neighbourhood_cleansedCrocker Amazon        0.915146    
## neighbourhood_cleansedVisitacion Valley     0.799806    
## neighbourhood_cleansedPresidio              0.436241    
## room_typePrivate room                        < 2e-16 ***
## room_typeShared room                         < 2e-16 ***
## accommodates                                 < 2e-16 ***
## bathrooms                                   4.81e-09 ***
## bedrooms                                     < 2e-16 ***
## latitude                                    1.85e-05 ***
## longitude                                   0.684435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.83 on 5315 degrees of freedom
## Multiple R-squared:  0.5049, Adjusted R-squared:  0.501 
## F-statistic:   129 on 42 and 5315 DF,  p-value: < 2.2e-16

The model has taken the categorical variables and assigned dummy variables for each observation. It looks like not many of neighborhood_cleansed dummy variables are significant. Interestingly, latitude has a much larger effect than longitude.

Reviews

Reviews can reveal information about each listing that cannot be found in quantitative data such as the number of bedrooms or even amenities.

Sentiments Dataset

# Import datasets
sentiments <- read_csv("data/processed/sentiments.csv")
listings <- read_csv("data/processed/listings.csv")

The cleaned sentiments dataset is composed of word and sentiment for all of the words in each review. The reviews are identified by the reviewer_id. Each review spans across many rows.

sentiments %>% head() %>% kable()

listing_id	date	reviewer_id	word	sentiment
205842	2012-07-22	2936612	excellent	positive
205842	2012-07-22	2936612	charming	positive
205842	2012-07-22	2936612	sparkling	positive
205842	2012-07-22	2936612	clean	positive
205842	2012-07-22	2936612	well	positive
205842	2012-07-22	2936612	easy	positive

Top Words

It would be interesting to see what the most frequent positive and negative words in the reviews are.

# top positive words
positive <- sentiments %>% 
  filter(sentiment == "positive") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  select(-sentiment)

# top negative words
negative <- sentiments %>% 
  filter(sentiment == "negative") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  select(-sentiment)

kable(list(head(positive, 10), head(negative, 10)), align='c')

word	n
great	149096
clean	71224
nice	56583
comfortable	50816
recommend	42847
easy	39017
perfect	37257
well	35207
good	34121
quiet	30475

word	n
die	5350
noise	4450
problem	4284
issue	2794
issues	2149
hard	1968
bad	1957
cold	1810
noisy	1691
tout	1670

The most frequent positive words are on the left, and the most frequent negative words are on the right of the table above. In my opinion, these words are categorized correctly, and make sense. In the positive column, it seems that clean and comfortable Airbnb’s receive positive reviews. In the negative column, it appears that Airbnb’s that are loud and cold receive negative reviews.

Sentiment Scores

I will now count the number of positive and negative words, then calculate the sentiment score per review by subtracting the number of negative reviews from the number of positive reviews.

# sentiments per review
listing_sent <- sentiments %>% 
  # group by reviewer
  group_by(reviewer_id) %>% 
  # count positive and negatve sentiments
  count(listing_id, sentiment) %>% 
  # make new positive and negative count columns
  spread(key = sentiment, value = n, fill = 0) %>%  
  # calculate overall sentiment per review
  mutate(sent = positive - negative)

listing_sent %>% head() %>% kable()

reviewer_id	listing_id	negative	positive	sent
1	288213	0	8	8
1	1855096	1	10	9
1	2933105	0	11	11
3	9225	0	4	4
3	12522	0	8	8
3	14125	2	5	3

Then, I’ll find the average sentiment per review and join the dataset with the original listings.

# average sentiments per listing
listing_sent <- listing_sent %>% 
  # group by listing
  group_by(listing_id) %>% 
  # average sentiment per listing
  summarize(avg_sent = mean(sent)) %>%
  # join with original dataset
  left_join(listings, by = c("listing_id" = "id"))

listing_sent %>% 
  arrange(desc(avg_sent)) %>% 
  select(listing_id:number_of_reviews, review_scores_rating) %>% 
  head(10) %>% kable()

listing_id	avg_sent	price	number_of_reviews	review_scores_rating
4250927	9.968254	239	63	100
271505	9.777108	160	182	99
413663	9.661765	189	72	99
5325355	9.490196	200	73	99
2026910	9.445545	550	107	99
715754	9.421053	160	78	99
856123	9.315789	168	62	100
377452	8.880952	200	85	99
27025	8.857143	175	129	100
734839	8.759036	185	172	99

Above are the top 10 listings with the best reviews based on sentiment analysis. The review_scores_rating is the average score given by reviewers. The scores seem to match the sentiments of the reviews well. A better way to see if there is a trend would be to plot the data.

listing_sent %>% 
  ggplot(aes(x = as.factor(review_scores_rating), y = avg_sent)) +
  geom_boxplot() +
  labs(title = "Average Sentiment vs Review Scores",
       x = "Review Scores", y = "Average Sentiment")

There is a strong positive relationship between the average sentiment and review scores. This indicates that the sentiment analysis was successful.

Price

Intuitively, I would expect more positive reviews to be correlated with higher prices.

listing_sent %>% 
  ggplot(aes(x = price, y = avg_sent)) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") + 
  labs(title = "Average Sentiment Score vs Price",
       x = "Price", y = "Average Sentiment Score")

Above is a plot of price on the x-axis and avg_sent, the average sentiment score, on the y-axis. It seems that there is a positive relationship between them. It might help to take a look at where the points are more concentrated - lower prices.

listing_sent %>% 
  filter(price <= 250) %>% 
  ggplot(aes(x = price, y = avg_sent)) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") + 
  labs(title = "Average Sentiment Score vs Price",
       x = "Price", y = "Average Sentiment Score")

Above is the same plot as before, for listings priced at less than $250. The linear trend is a lot stronger now, indicated that there is a positive relationship between price and reviews.

San Francisco Airbnb Analysis

Amy Chen

December 9, 2018

Overview

Data

Goals