import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (LabelEncoder,
StandardScaler)from sklearn.model_selection import (train_test_split,
cross_val_score,
GridSearchCV,
RandomizedSearchCV, BaseCrossValidator)from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (mean_squared_error,
r2_score)
from hyperopt import hp, fmin, tpe, rand, STATUS_OK, Trials
Libraries
Libraries setup
%matplotlib inline
"ggplot")
plt.style.use("display.max_columns", None)
pd.set_option("display.max_colwidth", None) pd.set_option(
1. Research Question Definition
1.1 Data Analysis Question
We will be performing hyperparameter tuning techniques to the most accurate model in an effort to achieve optimal predictions.
1.2 Metric For Success
This will be a regression task, We will use the regression metrics to determine how the model works:
- \(R^2\) Score
- Mean Absolute Error
- Residual Sum of Squares
1.3 The Context
Build a solution the would make optimal predictions of rental prices for the city of Amsterdam.
1.4 Experimental Design
- Loading the dataset
- Exploring the dataset
- Data manipulation, data cleaning and visualization
- Data modeling
- Hyperparameter tuning
- Conclusion and recommendation
1.5 Data Relevance
The Data Provided in relevant to the research question
2. Data Cleaning & Data Preparation
2.1 Loading and Preview Datasets
= pd.read_csv("datasets/raw/listing_summary.csv")
rentals rentals.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2818 | Quiet Garden View Room & Super Fast WiFi | 3159 | Daniel | NaN | Oostelijk Havengebied - Indische Buurt | 52.36575 | 4.94142 | Private room | 59 | 3 | 278 | 2020-02-14 | 2.06 | 1 | 169 |
1 | 20168 | Studio with private bathroom in the centre 1 | 59484 | Alexander | NaN | Centrum-Oost | 52.36509 | 4.89354 | Private room | 100 | 1 | 340 | 2020-04-09 | 2.76 | 2 | 106 |
2 | 25428 | Lovely apt in City Centre (w.lift) near Jordaan | 56142 | Joan | NaN | Centrum-West | 52.37297 | 4.88339 | Entire home/apt | 125 | 14 | 5 | 2020-02-09 | 0.18 | 1 | 132 |
3 | 27886 | Romantic, stylish B&B houseboat in canal district | 97647 | Flip | NaN | Centrum-West | 52.38761 | 4.89188 | Private room | 155 | 2 | 217 | 2020-03-02 | 2.15 | 1 | 172 |
4 | 28871 | Comfortable double room | 124245 | Edwin | NaN | Centrum-West | 52.36719 | 4.89092 | Private room | 75 | 2 | 332 | 2020-03-16 | 2.82 | 3 | 210 |
Dataset Glossary
Before starting the analysis, let’s load the glossary to understand the column descriptions
with open("datasets/raw/Glossary - Sheet1 (1).csv", encoding='utf8') as file:
print(file.read())
room_id: A unique number identifying an Airbnb listing.
host_id: A unique number identifying an Airbnb host.
neighborhood: A subregion of the city or search area for which the survey is carried out. For some cities there is no neighbourhood information.
"room_type: One of “Entire home/apt”, “Private room”, or “Shared room”."
host_response_rate: The rate at which the particular host responds to the customers.
"price: The price (in $US) for a night stay. In early surveys, there may be some values that were recorded by month."
accomodates: The number of guests a listing can accommodate.
bathrooms: The number of bathrooms a listing offers.
bedrooms: The number of bedrooms a listing offers.
beds: The number of beds a listing offers.
"minimum_nights: The minimum stay for a visit, as posted by the host."
"maximum nights: The maximum stay for a visit, as posted by the host."
overall_satisfaction: The average rating (out of five) that the listing has received from those visitors who left a review.
"number_of_reviews: The number of reviews that a listing has received. Airbnb has said that 70% of visits end up with a review, so the number of reviews can be used to estimate the number of visits. Note that such an estimate will not be reliable for an individual listing (especially as reviews occasionally vanish from the site), but over a city as a whole it should be a useful metric of traffic."
reviews_per_month: The number of reviews that a listing has received per month.
host_listings_count: The number of listings for a particular host.
availability_365: The number of days for which a particular host is available in a year.
Data Exploration
The name
column contains the basic information about the Airbnb room, since this column won’t be of any help in our analysis or modeling, we will drop the column
=['name'], inplace=True)
rentals.drop(columns rentals.shape
(19362, 15)
The dataset has over 19362 rentals with 15 rentals information. Each row represents an Airbnb listing
rentals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19362 entries, 0 to 19361
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 19362 non-null int64
1 host_id 19362 non-null int64
2 host_name 19358 non-null object
3 neighbourhood_group 0 non-null float64
4 neighbourhood 19362 non-null object
5 latitude 19362 non-null float64
6 longitude 19362 non-null float64
7 room_type 19362 non-null object
8 price 19362 non-null int64
9 minimum_nights 19362 non-null int64
10 number_of_reviews 19362 non-null int64
11 last_review 17078 non-null object
12 reviews_per_month 17078 non-null float64
13 calculated_host_listings_count 19362 non-null int64
14 availability_365 19362 non-null int64
dtypes: float64(4), int64(7), object(4)
memory usage: 2.2+ MB
neighbourhood_group
19362 missing values, that means every row in this column is a missing value. last_review
and reviews_per_month
have the same number of missing values.
Handling Duplicates and Missing Values
We will start by - droping any duplicates if any, - remove the neighbourhood_group
- drop records with missing values for last_review
and reviews_per_month
=True)
rentals.drop_duplicates(inplace rentals.shape
(19362, 15)
There were no duplicates in the dataset. Let’s continue by checking missing values and dropping them
sum() rentals.isnull().
id 0
host_id 0
host_name 4
neighbourhood_group 19362
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2284
reviews_per_month 2284
calculated_host_listings_count 0
availability_365 0
dtype: int64
=['neighbourhood_group'], inplace=True)
rentals.drop(columns=True)
rentals.dropna(inplacesum() rentals.isnull().
id 0
host_id 0
host_name 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 0
reviews_per_month 0
calculated_host_listings_count 0
availability_365 0
dtype: int64
rentals.shape
(17075, 14)
After removing the missing values, the dataset now has 17075 Airbnb listings
Checking for anomalies
To ensure there are no anomalies in the dataset, we will check the number of unique items in every column
= rentals.columns.to_list()
cols
for col in cols:
print(f"{col:>34} No. of Unique Items:{rentals[col].nunique():>8}")
id No. of Unique Items: 17075
host_id No. of Unique Items: 15182
host_name No. of Unique Items: 5382
neighbourhood No. of Unique Items: 22
latitude No. of Unique Items: 5720
longitude No. of Unique Items: 9193
room_type No. of Unique Items: 4
price No. of Unique Items: 422
minimum_nights No. of Unique Items: 60
number_of_reviews No. of Unique Items: 422
last_review No. of Unique Items: 1590
reviews_per_month No. of Unique Items: 703
calculated_host_listings_count No. of Unique Items: 23
availability_365 No. of Unique Items: 366
Outliers visualization
=(20,10))
plt.figure(figsize
rentals.boxplot() plt.show()
Records with outliers
= rentals.quantile(.25)
q1 = rentals.quantile(.75)
q3 = q3 - q1
iqr = rentals[
outliers_df < (q1 - 1.5*iqr)) | (rentals > (q3 + 1.5*iqr))).any(axis=1)
((rentals
] outliers_df.shape
(8544, 14)
There are Over 8500 outliers in the dataset
5) outliers_df.sample(
id | host_id | host_name | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12117 | 22620913 | 2888051 | Lokke | Noord-West | 52.41524 | 4.89218 | Entire home/apt | 80 | 4 | 2 | 2018-05-06 | 0.08 | 1 | 0 |
3511 | 7166128 | 16634603 | Hermen | Centrum-Oost | 52.36991 | 4.92621 | Entire home/apt | 110 | 5 | 7 | 2018-08-19 | 0.12 | 1 | 0 |
15707 | 32722735 | 246045083 | Babette | De Baarsjes - Oud-West | 52.36514 | 4.86184 | Entire home/apt | 120 | 2 | 8 | 2019-06-29 | 0.64 | 1 | 0 |
13425 | 26037649 | 193680275 | Merle | Westerpark | 52.38374 | 4.87314 | Entire home/apt | 125 | 5 | 5 | 2018-11-18 | 0.24 | 1 | 0 |
730 | 1277443 | 6952882 | Jason & Lotte | Bos en Lommer | 52.38129 | 4.85746 | Entire home/apt | 98 | 5 | 4 | 2014-07-07 | 0.06 | 1 | 0 |
Let’s check the percentage of outliers to the original dataset
print(f"Percentage of outliers: {outliers_df.shape[0]/rentals.shape[0]*100:.2f}")
Percentage of outliers: 50.04
It would be tricky to drop the records with outliers since that will reduce our dataset by half so we will leave them there. However, we will drop the host_id variable later on, right before modeling.
3. Data Analysis
3.1 Univariate Data Analysis
Top 10 Most common hosts
=(20,12))
plt.figure(figsize10].plot(kind='bar')
rentals.host_name.value_counts()[:"Top 10 Hosts")
plt.title("Host Name")
plt.xlabel("Number of Listings")
plt.ylabel( plt.show()
Martijn is the host name with most listings in the Airbnb rentals listing.
the top 10 most common neighbourhoods
= (20, 12))
plt.figure(figsize 10].sort_values().plot(kind = 'barh')
rentals.neighbourhood.value_counts()[:= "right")
plt.xticks(ha "Top 10 Neighbourhoods")
plt.title("Number of Listings")
plt.xlabel("Neighbourhood")
plt.ylabel( plt.show()
De Baarsjes - Oud-West Had the most number of rental listings
the most common room types
rentals.room_type.value_counts()
Entire home/apt 13308
Private room 3497
Hotel room 232
Shared room 38
Name: room_type, dtype: int64
=(8,8))
plt.figure(figsize='pie', autopct="%0.1f%%", labels=rentals.room_type.value_counts().index)
rentals.room_type.value_counts().plot(kind
plt.legend()"the most common room types")
plt.title( plt.show()
distribution of price
= (14,8))
plt.figure(figsize
sns.distplot(rentals.price)"Distribution of $price$")
plt.title( plt.show()
The distribution of price is skewed to the right and is not continous.
the top 10 most common minimum number of nights to spend
=(20,10))
plt.figure(figsize10].sort_values().plot(kind='barh')
rentals.minimum_nights.value_counts()[:"the top 10 most common minimum number of nights to spend")
plt.title("Number of listings")
plt.xlabel("Minimum Nights")
plt.ylabel( plt.show()
The most common minimum number of nights to spend is 2.
3.2 Bivariate Analysis
price by room type
'price', by='room_type', figsize=(20,12))
rentals.hist( plt.show()
average price by neighbourhood
'neighbourhood')['price'].mean().sort_values(ascending=False) rentals.groupby(
neighbourhood
Centrum-West 203.653397
Centrum-Oost 190.534247
Zuid 178.813243
Oud-Noord 168.763566
De Pijp - Rivierenbuurt 166.601787
IJburg - Zeeburgereiland 157.723077
Westerpark 151.241140
De Baarsjes - Oud-West 149.376228
Oud-Oost 144.606957
Watergraafsmeer 141.199125
Buitenveldert - Zuidas 137.690355
Oostelijk Havengebied - Indische Buurt 135.101163
Noord-Oost 130.495833
Noord-West 126.318885
Bos en Lommer 122.525304
De Aker - Nieuw Sloten 121.268908
Slotervaart 119.791549
Geuzenveld - Slotermeer 113.937143
Osdorp 103.418182
Gaasperdam - Driemond 96.317757
Bijlmer-Centrum 91.434343
Bijlmer-Oost 89.696629
Name: price, dtype: float64
=(20, 10))
plt.figure(figsize'neighbourhood')['price'].mean().sort_values(ascending=False)[:10].sort_values().plot(kind='barh')
rentals.groupby("Average Neighbourhood Price")
plt.xlabel("average price by neighbourhood")
plt.title( plt.show()
Based on neighbour hood, Centrum West Has the highest average weight of listings.
average price by minimum_nights
'minimum_nights')['price'].mean().sort_values(ascending=False)[:15] rentals.groupby(
minimum_nights
365 3000.000000
200 999.000000
99 999.000000
52 429.000000
30 331.551020
222 300.000000
27 275.000000
48 250.000000
31 243.750000
150 203.000000
21 201.761905
15 197.888889
1000 185.000000
240 180.000000
28 175.900000
Name: price, dtype: float64
=(20, 10))
plt.figure(figsize'minimum_nights')['price'].mean().sort_values(ascending=False)[:10].sort_values().plot(kind='barh')
rentals.groupby("Average minimum nights Price")
plt.xlabel("average price by minimum nights")
plt.title( plt.show()
By average, those who spend 365 nights spend most price
3.3 Feature Engineering
rentals.head()
id | host_id | host_name | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2818 | 3159 | Daniel | Oostelijk Havengebied - Indische Buurt | 52.36575 | 4.94142 | Private room | 59 | 3 | 278 | 2020-02-14 | 2.06 | 1 | 169 |
1 | 20168 | 59484 | Alexander | Centrum-Oost | 52.36509 | 4.89354 | Private room | 100 | 1 | 340 | 2020-04-09 | 2.76 | 2 | 106 |
2 | 25428 | 56142 | Joan | Centrum-West | 52.37297 | 4.88339 | Entire home/apt | 125 | 14 | 5 | 2020-02-09 | 0.18 | 1 | 132 |
3 | 27886 | 97647 | Flip | Centrum-West | 52.38761 | 4.89188 | Private room | 155 | 2 | 217 | 2020-03-02 | 2.15 | 1 | 172 |
4 | 28871 | 124245 | Edwin | Centrum-West | 52.36719 | 4.89092 | Private room | 75 | 2 | 332 | 2020-03-16 | 2.82 | 3 | 210 |
'room_type_encoded'] = LabelEncoder().fit_transform(rentals.room_type)
rentals[ rentals.head()
id | host_id | host_name | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | room_type_encoded | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2818 | 3159 | Daniel | Oostelijk Havengebied - Indische Buurt | 52.36575 | 4.94142 | Private room | 59 | 3 | 278 | 2020-02-14 | 2.06 | 1 | 169 | 2 |
1 | 20168 | 59484 | Alexander | Centrum-Oost | 52.36509 | 4.89354 | Private room | 100 | 1 | 340 | 2020-04-09 | 2.76 | 2 | 106 | 2 |
2 | 25428 | 56142 | Joan | Centrum-West | 52.37297 | 4.88339 | Entire home/apt | 125 | 14 | 5 | 2020-02-09 | 0.18 | 1 | 132 | 0 |
3 | 27886 | 97647 | Flip | Centrum-West | 52.38761 | 4.89188 | Private room | 155 | 2 | 217 | 2020-03-02 | 2.15 | 1 | 172 | 2 |
4 | 28871 | 124245 | Edwin | Centrum-West | 52.36719 | 4.89092 | Private room | 75 | 2 | 332 | 2020-03-16 | 2.82 | 3 | 210 | 2 |
rentals.neighbourhood.nunique()
22
= rentals.drop(columns=['id', 'host_id', 'host_name', 'neighbourhood', 'last_review', 'room_type'])
rentals_2 rentals_2.head()
latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | room_type_encoded | |
---|---|---|---|---|---|---|---|---|---|
0 | 52.36575 | 4.94142 | 59 | 3 | 278 | 2.06 | 1 | 169 | 2 |
1 | 52.36509 | 4.89354 | 100 | 1 | 340 | 2.76 | 2 | 106 | 2 |
2 | 52.37297 | 4.88339 | 125 | 14 | 5 | 0.18 | 1 | 132 | 0 |
3 | 52.38761 | 4.89188 | 155 | 2 | 217 | 2.15 | 1 | 172 | 2 |
4 | 52.36719 | 4.89092 | 75 | 2 | 332 | 2.82 | 3 | 210 | 2 |
4. Data Modeling
= rentals_2.drop(columns=['price'])
X = rentals_2.price y
= StandardScaler().fit_transform(X)
X 5] X[:
array([[ 0.01659773, 1.45384638, -0.02542703, 4.36034888, 0.92231486,
-0.21630069, 1.18401885, 1.92103452],
[-0.02390588, 0.11006752, -0.15344314, 5.44467417, 1.44564814,
0.06606432, 0.54605829, 1.92103452],
[ 0.45968261, -0.17479788, 0.67866153, -0.4141802 , -0.4832088 ,
-0.21630069, 0.8093436 , -0.52593784],
[ 1.3581262 , 0.0634787 , -0.08943509, 3.29351271, 0.98960057,
-0.21630069, 1.21439792, 1.92103452],
[ 0.10496923, 0.03653577, -0.08943509, 5.30476123, 1.49050528,
0.34842932, 1.59919952, 1.92103452]])
= train_test_split(X, y, test_size=.3) X_train, X_test, y_train, y_test
For purposes of simplicity, we will work with the following regressors:
- Decision Tree Regressor
- Random Forest Regressor
4.1 Normal Modeling
= DecisionTreeRegressor()
dt = RandomForestRegressor() rf
dt.fit(X_train, y_train)= dt.predict(X_test)
dt_pred print(f"DT RMSE: {np.sqrt(mean_squared_error(y_test, dt_pred)):.2f}")
print(f"DT R2: {r2_score(y_test, dt_pred):.2f}")
DT RMSE: 249.36
DT R2: -5.03
rf.fit(X_train, y_train)= rf.predict(X_test)
rf_pred print(f"rf RMSE: {np.sqrt(mean_squared_error(y_test, rf_pred)):.2f}")
print(f"rf R2: {r2_score(y_test, rf_pred):.2f}")
rf RMSE: 111.97
rf R2: -0.22
int(y.shape[0]*.1)).mean() y.sample(
154.93731693028704
Both models are performing well, we will use Random Forest Regressor because it is the best perfroming model producing the best fit.
4.2 Modeling with Grid Search
rf.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
= {
param_grid 'bootstrap': [True, False],
'n_estimators': [100,110,120,130,140],
'max_depth': [10, 20, 30, 40, 50]
}= GridSearchCV(rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search = grid_search.fit(X_train, y_train)
grid_result print(f"Best Score: {grid_result.best_score_}\nBest Parameters: {grid_result.best_params_}")
Best Score: -0.060206490328136075
Best Parameters: {'bootstrap': True, 'max_depth': 40, 'n_estimators': 140}
We will use the generated parameters to train our model
= RandomForestRegressor(bootstrap=True, max_depth=40, n_estimators=140, n_jobs=-1)
rf_gs
rf_gs.fit(X_train, y_train)= rf_gs.predict(X_test)
rf_gs_pred print(f"RMSE: {np.sqrt(mean_squared_error(y_test, rf_gs_pred)):.2f}\nR2 Score: {r2_score(y_test, rf_gs_pred):2f}")
RMSE: 112.40
R2 Score: -0.225879
4.3 Modeling with Random Search
= RandomizedSearchCV(rf, param_distributions=param_grid, cv=5)
random_search = random_search.fit(X_train, y_train)
random_result print(f"Best Score: {random_result.best_score_}\nBest Parameters: {random_result.best_params_}")
Best Score: -0.11714015166923782
Best Parameters: {'n_estimators': 120, 'max_depth': 40, 'bootstrap': True}
= RandomForestRegressor(bootstrap=True, max_depth=40, n_estimators=140, n_jobs=-1)
rf_rs
rf_rs.fit(X_train, y_train)= rf_rs.predict(X_test)
rf_rs_pred print(f"RMSE: {np.sqrt(mean_squared_error(y_test, rf_rs_pred)):.2f}\nR2 Score: {r2_score(y_test, rf_rs_pred):2f}")
RMSE: 108.37
R2 Score: -0.139507
4.4 Modeling with Bayesian Optimization
def objective(space):
= RandomForestRegressor(
model =space['n_estimators'],
n_estimators=space['bootstrap'],
bootstrap=space['max_depth']
max_depth
)= cross_val_score(model, X_train, y_train, cv=5).mean()
accuracy return {'loss': -accuracy, 'status': STATUS_OK }
= {
param_grid 'bootstrap': hp.choice('bootstrap', [True, False]),
'n_estimators': hp.choice('n_estimators', [100,110,120,130,140]),
'max_depth': hp.quniform('max_depth', 10, 50, 10)
}= Trials()
trials = fmin(fn= objective,
best = param_grid,
space= tpe.suggest,
algo= 100,
max_evals = trials)
trials best
100%|███████████████████████████████████████████| 100/100 [1:20:39<00:00, 48.39s/trial, best loss: 0.05966699243049871]
{'bootstrap': 0, 'max_depth': 10.0, 'n_estimators': 3}
= {0:True, 1:False}
btsp = {0:100,1:110,2:120,3:130,4:140}
n_est = RandomForestRegressor(
rf_bo =btsp[best['bootstrap']],
bootstrap= best['max_depth'],
max_depth =n_est[best['n_estimators']]
n_estimators
).fit(X_train, y_train)= rf_bo.predict(X_test)
rf_bo_pred print(f"RMSE: {np.sqrt(mean_squared_error(y_test, rf_bo_pred)):.2f}\nR2 Score: {r2_score(y_test, rf_bo_pred):2f}")
RMSE: 107.42
R2 Score: -0.119587
5. Summary of Findings
- By performing hyperparameter tuning, we have achieved a model that achieves optimal predictions
- Compared to
GridSearchCV
andRandomizedSearchCV
, Bayesian Optimization is a superior tuning approach that produces better results in less time.
6. Recommendations
- More data need to be added. When we have more data, the model will be more accurate and this will minimize bias in prediction.
- More features need to be added. In addition to the data we need more features about rental listings so that the data won’t depend on minimal features during prediction
7. Challenging the Solution
- The question was clear since we were required to perform hyperparameter tuning techniques to your most accurate model in an effort to achieve optimal predictions.
- This was a right data but Major imporvements need to be added to the data, more features need to be added to make the model more accurate. Missing values needs to be addressed during data collection
- To improve the solution, more data is needed to train the model to be as accurate as possible in the prediction.