Assignment III: Classification: Will this Traveler Be Satisfied?

Part I: Logistic Regression Model

Part A
In [1]:
# loading necessary packages
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
In [2]:
# loading data
data = pd.read_csv('euro_hotels.csv')
In [3]:
data.head()
Out[3]:
id Gender Age purpose_of_travel Type of Travel Type Of Booking Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness satisfaction
0 70172 Male 13 aviation Personal Travel Not defined 3 4 3 1 5 5 5 4 5 5 neutral or dissatisfied
1 5047 Male 25 tourism Group Travel Group bookings 3 2 3 3 1 1 1 1 4 1 neutral or dissatisfied
2 110028 Female 26 tourism Group Travel Group bookings 2 2 2 2 5 5 5 4 4 5 satisfied
3 24026 Female 25 tourism Group Travel Group bookings 2 5 5 5 2 2 2 1 4 2 neutral or dissatisfied
4 119299 Male 61 aviation Group Travel Group bookings 3 3 3 3 4 5 3 3 3 3 satisfied
Part B

Which of the variables here are categorical? Which are numerical?

The categorical variables are Gender, Age, purpose_of_travel, Type of Travel, Type of Booking and satisfaction.

The numerical variables are Hotel wifi service, Departure/Arrival convenience, Ease of Online booking, Hotel location, Food and drink, Stay comfort, Common Room entertainment, Checkin/Checkout service, Other service and Cleanliness.

id variable is also numeric variable, but it is not a relevant variable for modelling.

Part C
In [4]:
data['satisfaction'].value_counts()
Out[4]:
neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64

Describe your findings - what are the different outcome classes here, and how common are each of them in the dataset?

There are two outcome classes - 'satisfied' and 'neutral or dissatisfied'. The 'neutral or dissatisfied' is more common that the other class in the dataset. The chances of a traveller being neutral or dissatisfied is 56.67% and the chance that he will be satisfied is 43.33%.

Part D
In [5]:
from sklearn.preprocessing import LabelEncoder
y_encoder = LabelEncoder()
data['satisfaction'] = y_encoder.fit_transform(data['satisfaction'])
In [6]:
data.head()
Out[6]:
id Gender Age purpose_of_travel Type of Travel Type Of Booking Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness satisfaction
0 70172 Male 13 aviation Personal Travel Not defined 3 4 3 1 5 5 5 4 5 5 0
1 5047 Male 25 tourism Group Travel Group bookings 3 2 3 3 1 1 1 1 4 1 0
2 110028 Female 26 tourism Group Travel Group bookings 2 2 2 2 5 5 5 4 4 5 1
3 24026 Female 25 tourism Group Travel Group bookings 2 5 5 5 2 2 2 1 4 2 0
4 119299 Male 61 aviation Group Travel Group bookings 3 3 3 3 4 5 3 3 3 3 1

Comparing the first five rows now, vs. the way they looked when you originally called the head() function, what changed?

The "satisfaction" column values changed. The values earlier were strings whereas after applying label encoder, we have converted the column to have numerical values for the 2 values - satisfied (as 1) and neutral or dissatisfied (as 0).

Part E

For your categorical input variables, do you need to take any steps to convert them into dummies, in order to build a logistic regression model? Why or why not?

Yes, the categorical input variables are currently represented as strings and we need to convert the variables to numeric data type so that we can pass the data through the model. It is important to dummify each categorical variable so that it can be represented as numeric input variable to model.

In [7]:
model_data = pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=True, dtype=None)

model_data.head()
Out[7]:
id Age Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service ... Cleanliness satisfaction Gender_Male purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Personal Travel Type Of Booking_Individual/Couple Type Of Booking_Not defined
0 70172 13 3 4 3 1 5 5 5 4 ... 5 0 1 1 0 0 0 1 0 1
1 5047 25 3 2 3 3 1 1 1 1 ... 1 0 1 0 0 0 1 0 0 0
2 110028 26 2 2 2 2 5 5 5 4 ... 5 1 0 0 0 0 1 0 0 0
3 24026 25 2 5 5 5 2 2 2 1 ... 2 0 0 0 0 0 1 0 0 0
4 119299 61 3 3 3 3 4 5 3 3 ... 3 1 1 1 0 0 0 0 0 0

5 rows × 21 columns

In [8]:
model_data.columns
Out[8]:
Index(['id', 'Age', 'Hotel wifi service', 'Departure/Arrival  convenience',
       'Ease of Online booking', 'Hotel location', 'Food and drink',
       'Stay comfort', 'Common Room entertainment', 'Checkin/Checkout service',
       'Other service', 'Cleanliness', 'satisfaction', 'Gender_Male',
       'purpose_of_travel_aviation', 'purpose_of_travel_business',
       'purpose_of_travel_personal', 'purpose_of_travel_tourism',
       'Type of Travel_Personal Travel', 'Type Of Booking_Individual/Couple',
       'Type Of Booking_Not defined'],
      dtype='object')
In [9]:
# X and y variables
X = model_data.drop(['id','satisfaction'], axis = 1)
y = model_data['satisfaction']
In [10]:
print('Predictor Features Shape:',X.shape)
print('Target Variable Shape:',y.shape)
Predictor Features Shape: (103904, 19)
Target Variable Shape: (103904,)
Part F

My lucky number is 9 and will be using the same as random seed to split dataset into training and testing.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=9)
In [12]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(62342, 19)
(62342,)
(41562, 19)
(41562,)
Part G
In [13]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
Out[13]:
LogisticRegression()
Part H
In [14]:
# observing model coefficients and intercept
print(log_reg.coef_)
print('\n')
print(log_reg.intercept_)
[[ 0.01140767  0.59735789 -0.0090887  -0.0663425  -0.20372393 -0.17828225
   0.24762543  0.48228278  0.33702107  0.12484225  0.11425301 -0.01303704
  -0.29037067 -0.22250768 -0.26157978 -0.22189142 -1.75129811 -1.62300289
  -1.99981392]]


[-4.07080333]
In [15]:
pd.DataFrame(log_reg.coef_[0], X_train.columns, columns = ['Coefficient'])
Out[15]:
Coefficient
Age 0.011408
Hotel wifi service 0.597358
Departure/Arrival convenience -0.009089
Ease of Online booking -0.066343
Hotel location -0.203724
Food and drink -0.178282
Stay comfort 0.247625
Common Room entertainment 0.482283
Checkin/Checkout service 0.337021
Other service 0.124842
Cleanliness 0.114253
Gender_Male -0.013037
purpose_of_travel_aviation -0.290371
purpose_of_travel_business -0.222508
purpose_of_travel_personal -0.261580
purpose_of_travel_tourism -0.221891
Type of Travel_Personal Travel -1.751298
Type Of Booking_Individual/Couple -1.623003
Type Of Booking_Not defined -1.999814

Which of your numeric variables appear to influence the outcome variable the most? Which ones have the least impact?

The higher the magnitude of variable coefficient, the more influence the variable will have on the outcome. The numeric variables that appear to influence the most are Hotel wifi service, Common Room entertainment, Checkin/Checkout service and Stay comfort. The variables having least impact are Departure/Arrival convenience, Age and Ease of Online booking as these variables have coefficient value closer to 0.

The 3 chosen variables are - Common Room entertainment, Hotel wifi service and Stay comfort. The following details the importance of each of the chosen variable:

  1. Common Room entertainment: While someone is on a solo trip exploring a new city/country, everyone enjoys meeting other travellers and bonding with them. A hotel that offers good entertainment services in the common room will attract many travellers to spend time there and in the process people spend time with one another

  2. Hotel wifi service: This is very important when someone is travelling for business. A good wifi service ensures that one can do their work and do meetings while staying in their hotel room as well

  3. Stay comfort: A comfortable stay is what everyone expects while booking a hotel especially when visiting a place far away from home. A comfortable stay ensures that the person is active and energetic throughout the day

Now look at the categorical variables and their coefficients. Write a paragraph with your opinion about the ‘type of travel’ and ‘purpose of travel’ coefficients shown here.

'Type of travel' impacts the model's outcome significantly and the coefficient suggests that travellers are less likely to be satisfied when the type of travel is Personal Travel as compared to Group Travel. This is because the the coefficient for personal travel type is negative.

'purpose of travel' also affects model's outcome. We have coefficients for 4 purpose of travel - aviation, business, personal and tourism. The 5th purpose, 'academic' forms the base case. The coefficient is negative which indicates that people travelling for academic purposes are most likely to have a satisfied stay at hotel. People travelling for aviation purpose are most likely to have neutral or dissatified stay at hotel.

Part I
In [17]:
# making predictions on test set
y_pred = log_reg.predict(X_test)
In [18]:
# confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
[[20232  3359]
 [ 3311 14660]]
In [19]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.86      0.86      0.86     23591
           1       0.81      0.82      0.81     17971

    accuracy                           0.84     41562
   macro avg       0.84      0.84      0.84     41562
weighted avg       0.84      0.84      0.84     41562

In [20]:
# accuracy
acc=(conf_matrix[0,0]+conf_matrix[1,1])/np.sum(conf_matrix)
print ('Accuracy : ', acc)
# sensitivity
sensitivity = conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,0])
print('Sensitivity : ', sensitivity )
# specificity
specificity = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])
print('Specificity : ', specificity)
# precision
precision = conf_matrix[1,1]/(conf_matrix[0,1]+conf_matrix[1,1])
print('Precision : ', precision)
# balanced accuracy
bal_acc = (sensitivity+specificity)/2
print('Balanced Accuracy : ', bal_acc)
Accuracy :  0.8395168663683172
Sensitivity :  0.8157587223860664
Specificity :  0.8576151922343267
Precision :  0.8135856595815528
Balanced Accuracy :  0.8366869573101965

What is your model’s accuracy rate?
The model's accuracy rate is 83.95%

What is your model’s sensitivity rate?
The model's sensitivity rate is 81.58%

What is your model’s specificity rate?
The model's specificity rate is 85.76%

What is your model’s precision?
The model's precision is 0.814

What is your model’s balanced accuracy?
The model’s balanced accuracy is 83.67%

Part J
In [21]:
from sklearn.metrics import accuracy_score
train_preds = log_reg.predict(X_train)

print('Training Set Accuracy:', accuracy_score(y_train, train_preds))
print('Test Set Accuracy:', accuracy_score(y_test, y_pred))
Training Set Accuracy: 0.8408456578229765
Test Set Accuracy: 0.8395168663683172

What is the purpose of comparing those two values?

The purpose of comparing the two values is to understand whether the trained model is overfitting on the training dataset or not. Usually, if the model accuracy is high on training data set but low on test set, then our model is overfitting.

In this case, what does the comparison of those values suggest about the model that you have built?

In our case, the accuracy results on training dataset and test dataset are very close. This suggests that our logistic regression model is a good model and we do not face the issue of model overfitting on training dataset.

Part K
In [22]:
# made up traveller
traveller_data = [32, 4, 2, 4, 3, 2, 3, 4, 4, 3, 2, 1, 0, 0, 0, 1, 0, 1, 0]
new_traveler = pd.DataFrame([traveller_data], columns = X_train.columns)
new_traveler
Out[22]:
Age Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness Gender_Male purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Personal Travel Type Of Booking_Individual/Couple Type Of Booking_Not defined
0 32 4 2 4 3 2 3 4 4 3 2 1 0 0 0 1 0 1 0
In [23]:
# prediction
log_reg.predict(new_traveler)
Out[23]:
array([1])

What did your model predict -- will this person be satisfied?

The model predicts that the person will be satisfied.

In [24]:
# prediction probability
log_reg.predict_proba(new_traveler)
Out[24]:
array([[0.45076432, 0.54923568]])

According to your model, what is the probability that the person will be satisfied?

The probability that the person will be satisfied is 0.549

When using a logistic regression model to make predictions, why is it important to only use values within the range of the dataset used to build the model?

It is important to use values within range of dataset because in Logistic Regression model, we are basically performing regression between predictor variables and logit function. The probability calculation involves multiplying the variable input value with its coefficient. Thus, if the values are out of range then the overall calculation of logit value will also be abnormal and hence the results will not be reliable. In other words, if there are very high or very low values (out of range), then these values start dominating the overall outcome of the model.

Part L
In [25]:
# made up traveller
traveller_1 = [125, 2, 8, 1, 9, 12, 3, 4, 4, 3, 2, 1, 0, 0, 0, 1, 0, 1, 0]
traveller_1 = pd.DataFrame([traveller_1], columns = X_train.columns)
traveller_1
Out[25]:
Age Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness Gender_Male purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Personal Travel Type Of Booking_Individual/Couple Type Of Booking_Not defined
0 125 2 8 1 9 12 3 4 4 3 2 1 0 0 0 1 0 1 0
In [26]:
# prediction
log_reg.predict(traveller_1)
Out[26]:
array([0])
In [27]:
# prediction probability
log_reg.predict_proba(traveller_1)
Out[27]:
array([[0.94250755, 0.05749245]])

In this made up traveller, we have put some out of range values for Age, Departure/Arrival convenience, Hotel location and Food and drink variables. 3 of these variables have a negative impact on traveller's satisfaction probability and since we have put values higher than the normal range, we expect that the model will predict dissatisfied with very high probability.

We can see from the model's probability results as well, the model says the traveller has 94.25% chance that he will be neutral or dissatified with their stay. The probability is very high for dissatisfied which is what we expect as well since we have put out of range high values for the above mentioned variables.

Part II: Random Forest Model

Part M
In [28]:
# loading data
import pandas as pd
data = pd.read_csv('euro_hotels.csv')
data.head()
Out[28]:
id Gender Age purpose_of_travel Type of Travel Type Of Booking Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness satisfaction
0 70172 Male 13 aviation Personal Travel Not defined 3 4 3 1 5 5 5 4 5 5 neutral or dissatisfied
1 5047 Male 25 tourism Group Travel Group bookings 3 2 3 3 1 1 1 1 4 1 neutral or dissatisfied
2 110028 Female 26 tourism Group Travel Group bookings 2 2 2 2 5 5 5 4 4 5 satisfied
3 24026 Female 25 tourism Group Travel Group bookings 2 5 5 5 2 2 2 1 4 2 neutral or dissatisfied
4 119299 Male 61 aviation Group Travel Group bookings 3 3 3 3 4 5 3 3 3 3 satisfied
In [29]:
# data partition with same seed value - lucky number 9
from sklearn.preprocessing import LabelEncoder
y_encoder = LabelEncoder()
data['satisfaction'] = y_encoder.fit_transform(data['satisfaction'])
In [30]:
data.head()
Out[30]:
id Gender Age purpose_of_travel Type of Travel Type Of Booking Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness satisfaction
0 70172 Male 13 aviation Personal Travel Not defined 3 4 3 1 5 5 5 4 5 5 0
1 5047 Male 25 tourism Group Travel Group bookings 3 2 3 3 1 1 1 1 4 1 0
2 110028 Female 26 tourism Group Travel Group bookings 2 2 2 2 5 5 5 4 4 5 1
3 24026 Female 25 tourism Group Travel Group bookings 2 5 5 5 2 2 2 1 4 2 0
4 119299 Male 61 aviation Group Travel Group bookings 3 3 3 3 4 5 3 3 3 3 1
In [31]:
rf_model_data = pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

rf_model_data.head()
Out[31]:
id Age Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service ... purpose_of_travel_academic purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Group Travel Type of Travel_Personal Travel Type Of Booking_Group bookings Type Of Booking_Individual/Couple Type Of Booking_Not defined
0 70172 13 3 4 3 1 5 5 5 4 ... 0 1 0 0 0 0 1 0 0 1
1 5047 25 3 2 3 3 1 1 1 1 ... 0 0 0 0 1 1 0 1 0 0
2 110028 26 2 2 2 2 5 5 5 4 ... 0 0 0 0 1 1 0 1 0 0
3 24026 25 2 5 5 5 2 2 2 1 ... 0 0 0 0 1 1 0 1 0 0
4 119299 61 3 3 3 3 4 5 3 3 ... 0 1 0 0 0 1 0 1 0 0

5 rows × 25 columns

In [32]:
rf_model_data.columns
Out[32]:
Index(['id', 'Age', 'Hotel wifi service', 'Departure/Arrival  convenience',
       'Ease of Online booking', 'Hotel location', 'Food and drink',
       'Stay comfort', 'Common Room entertainment', 'Checkin/Checkout service',
       'Other service', 'Cleanliness', 'satisfaction', 'Gender_Female',
       'Gender_Male', 'purpose_of_travel_academic',
       'purpose_of_travel_aviation', 'purpose_of_travel_business',
       'purpose_of_travel_personal', 'purpose_of_travel_tourism',
       'Type of Travel_Group Travel', 'Type of Travel_Personal Travel',
       'Type Of Booking_Group bookings', 'Type Of Booking_Individual/Couple',
       'Type Of Booking_Not defined'],
      dtype='object')
In [33]:
# X and y variables
X = rf_model_data.drop(['id','satisfaction'], axis = 1)
y = rf_model_data['satisfaction']

print('Predictor Features Shape:',X.shape)
print('Target Variable Shape:',y.shape)
Predictor Features Shape: (103904, 23)
Target Variable Shape: (103904,)
Part N

Using same seed number i.e. 9 to split data.

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=9)
Part O
In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

rf_classifier = RandomForestClassifier(random_state=9)

# defining parameter grid for grid search
params_grid = {'n_estimators': [100, 500], 'max_features': ['sqrt', 'log2'], 'max_depth' : [3,5,10], 
              'criterion' :['gini', 'entropy']}

# grid search to fit model
rf_grid_cv = GridSearchCV(estimator=rf_classifier, param_grid=params_grid, cv= 5, n_jobs=-1)
rf_grid_cv.fit(X_train, y_train)
Out[35]:
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=9), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 10],
                         'max_features': ['sqrt', 'log2'],
                         'n_estimators': [100, 500]})
In [36]:
# best accuracy
rf_grid_cv.best_score_
Out[36]:
0.9273684319402783
In [37]:
# best parameters
rf_grid_cv.best_params_
Out[37]:
{'criterion': 'gini',
 'max_depth': 10,
 'max_features': 'sqrt',
 'n_estimators': 500}
In [38]:
# fitting best model
best_rf = RandomForestClassifier(n_estimators=500, criterion='gini', max_depth=10, max_features='sqrt', 
                                 random_state=9)
best_rf.fit(X_train, y_train)
Out[38]:
RandomForestClassifier(max_depth=10, max_features='sqrt', n_estimators=500,
                       random_state=9)
Part P
In [39]:
# feature importance plot
import matplotlib.pyplot as plt
import numpy as np

feat_names = X_train.columns
feat_imp = best_rf.feature_importances_
imp_order = np.argsort(feat_imp)

plt.figure(figsize=(12,6))
plt.barh(range(len(imp_order)), feat_imp[imp_order], color='gray', align='center')
plt.title('Feature Importances')
plt.yticks(range(len(imp_order)), [feat_names[i] for i in imp_order])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x600 with 1 Axes>
In [40]:
# feature importance dataframe
imp_data = pd.DataFrame(columns = ['Variable','Importance'])
imp_data['Variable'] = feat_names
imp_data['Importance'] = feat_imp
imp_data.sort_values(by='Importance', axis = 0, ascending=False).reset_index(drop = True)
Out[40]:
Variable Importance
0 Hotel wifi service 0.201018
1 Type Of Booking_Group bookings 0.116467
2 Common Room entertainment 0.103294
3 Type Of Booking_Individual/Couple 0.088914
4 Type of Travel_Personal Travel 0.084837
5 Type of Travel_Group Travel 0.083725
6 Stay comfort 0.076684
7 Ease of Online booking 0.052779
8 Cleanliness 0.043228
9 Other service 0.042396
10 Checkin/Checkout service 0.030547
11 Age 0.022305
12 Hotel location 0.016778
13 Departure/Arrival convenience 0.014857
14 Food and drink 0.013654
15 Type Of Booking_Not defined 0.004166
16 Gender_Female 0.000924
17 Gender_Male 0.000890
18 purpose_of_travel_tourism 0.000540
19 purpose_of_travel_academic 0.000536
20 purpose_of_travel_aviation 0.000520
21 purpose_of_travel_business 0.000520
22 purpose_of_travel_personal 0.000422

For a random forest model, how can you interpret feature importance?

The feature importance describes which features are relevant for making prediction about the outcome. The feature importance calculated using the Random Forest model's inbuilt method calculates the Gini Importance. The model measure the decrease in impurity of the split for each feature. Across all trees in the forest, we average the average decrease in impurity for each feature. This average over all trees in the forest is the measure of the feature importance.

Thus, in simple terms, the higher the importance of a feature, the more important that feature is for the model to predict accurately.

Part Q
In [41]:
# making predictions on test set
rf_preds = best_rf.predict(X_test)
In [42]:
# confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, rf_preds)
print(conf_matrix)
[[21896  1695]
 [ 1355 16616]]
In [43]:
print(classification_report(y_test, rf_preds))
              precision    recall  f1-score   support

           0       0.94      0.93      0.93     23591
           1       0.91      0.92      0.92     17971

    accuracy                           0.93     41562
   macro avg       0.92      0.93      0.93     41562
weighted avg       0.93      0.93      0.93     41562

In [44]:
# accuracy
acc=(conf_matrix[0,0]+conf_matrix[1,1])/np.sum(conf_matrix)
print ('Accuracy : ', acc)
# sensitivity
sensitivity = conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,0])
print('Sensitivity : ', sensitivity )
# specificity
specificity = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])
print('Specificity : ', specificity)
# precision
precision = conf_matrix[1,1]/(conf_matrix[0,1]+conf_matrix[1,1])
print('Precision : ', precision)
# balanced accuracy
bal_acc = (sensitivity+specificity)/2
print('Balanced Accuracy : ', bal_acc)
Accuracy :  0.926615658534238
Sensitivity :  0.9246007456457627
Specificity :  0.9281505658937731
Precision :  0.9074326907323467
Balanced Accuracy :  0.9263756557697679

What is your model’s accuracy rate?
The model's accuracy rate is 92.66%

What is your model’s sensitivity rate?
The model's sensitivity rate is 92.46%

What is your model’s specificity rate?
The model's specificity rate is 92.82%

What is your model’s precision?
The model's precision is 0.907

What is your model’s balanced accuracy?
The model’s balanced accuracy is 92.64%

Part R
In [45]:
from sklearn.metrics import accuracy_score
rf_train_preds = best_rf.predict(X_train)

print('Training Set Accuracy:', accuracy_score(y_train, rf_train_preds))
print('Test Set Accuracy:', accuracy_score(y_test, rf_preds))
Training Set Accuracy: 0.9310416733502294
Test Set Accuracy: 0.926615658534238

How different were these results?

The results are comparable i.e. the accuracy results on training set and test set are very much similar. This suggests that our Random Forest model is not overfitting on dataset.

The made up traveller data needs to be adjusted as we kept all dummy variables for random forest.

Part S
In [46]:
# made up traveller
traveller_data_rf = [32, 4, 2, 4, 3, 2, 3, 4, 4, 3, 2, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0]
new_traveler_rf = pd.DataFrame([traveller_data_rf], columns = X_train.columns)
new_traveler_rf
Out[46]:
Age Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service ... purpose_of_travel_academic purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Group Travel Type of Travel_Personal Travel Type Of Booking_Group bookings Type Of Booking_Individual/Couple Type Of Booking_Not defined
0 32 4 2 4 3 2 3 4 4 3 ... 0 0 0 0 1 1 0 0 1 0

1 rows × 23 columns

In [47]:
# prediction
best_rf.predict(new_traveler_rf)
Out[47]:
array([0])
In [48]:
# prediction probability
best_rf.predict_proba(new_traveler_rf)
Out[48]:
array([[0.64505157, 0.35494843]])

Does the model think this person will be satisfied?

No, the model does not think that the person will be satisfied. The probability of person being satisfied is 0.3549 and at threshold of 0.5, the model predicts that the person will not be satisfied.

Part T

Write a 3-5 sentence paragraph that speculates about why Lobster Land might care about being able to use this model.

Since Lobster Land is planning to open an on-site hotel on its theme park property, the customer satisfaction will be a big factor for the hotel to grow their business and attract more people to book hotel with them. Since they are establishing a new hotel and competing with existing options for customers, they need to understand what factors drive customer satisfaction so that their hotel has best services to offer to their customers. They can use the model to understand what particular features are important or drive customer satisfaction by analyzing the feature importance values from model. These values suggest that a lot of people prefer to have good wifi connection and common room entertainment services to hang out with other travellers. The model also suggests that people visiting through group bookings tend to feel more satisfied with their stay, thus, Lobster Land can come up with some special offers on group bookings which can attract group travellers and in the process their hotel can have a higher customer satisfaction rate.