# loading necessary packages
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# loading data
data = pd.read_csv('euro_hotels.csv')
data.head()
Which of the variables here are categorical? Which are numerical?
The categorical variables are Gender, Age, purpose_of_travel, Type of Travel, Type of Booking and satisfaction.
The numerical variables are Hotel wifi service, Departure/Arrival convenience, Ease of Online booking, Hotel location, Food and drink, Stay comfort, Common Room entertainment, Checkin/Checkout service, Other service and Cleanliness.
id variable is also numeric variable, but it is not a relevant variable for modelling.
data['satisfaction'].value_counts()
Describe your findings - what are the different outcome classes here, and how common are each of them in the dataset?
There are two outcome classes - 'satisfied' and 'neutral or dissatisfied'. The 'neutral or dissatisfied' is more common that the other class in the dataset. The chances of a traveller being neutral or dissatisfied is 56.67% and the chance that he will be satisfied is 43.33%.
from sklearn.preprocessing import LabelEncoder
y_encoder = LabelEncoder()
data['satisfaction'] = y_encoder.fit_transform(data['satisfaction'])
data.head()
Comparing the first five rows now, vs. the way they looked when you originally called the head() function, what changed?
The "satisfaction" column values changed. The values earlier were strings whereas after applying label encoder, we have converted the column to have numerical values for the 2 values - satisfied (as 1) and neutral or dissatisfied (as 0).
For your categorical input variables, do you need to take any steps to convert them into dummies, in order to build a logistic regression model? Why or why not?
Yes, the categorical input variables are currently represented as strings and we need to convert the variables to numeric data type so that we can pass the data through the model. It is important to dummify each categorical variable so that it can be represented as numeric input variable to model.
model_data = pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=True, dtype=None)
model_data.head()
model_data.columns
# X and y variables
X = model_data.drop(['id','satisfaction'], axis = 1)
y = model_data['satisfaction']
print('Predictor Features Shape:',X.shape)
print('Target Variable Shape:',y.shape)
My lucky number is 9 and will be using the same as random seed to split dataset into training and testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=9)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# observing model coefficients and intercept
print(log_reg.coef_)
print('\n')
print(log_reg.intercept_)
pd.DataFrame(log_reg.coef_[0], X_train.columns, columns = ['Coefficient'])
Which of your numeric variables appear to influence the outcome variable the most? Which ones have the least impact?
The higher the magnitude of variable coefficient, the more influence the variable will have on the outcome. The numeric variables that appear to influence the most are Hotel wifi service, Common Room entertainment, Checkin/Checkout service and Stay comfort. The variables having least impact are Departure/Arrival convenience, Age and Ease of Online booking as these variables have coefficient value closer to 0.
The 3 chosen variables are - Common Room entertainment, Hotel wifi service and Stay comfort. The following details the importance of each of the chosen variable:
Common Room entertainment: While someone is on a solo trip exploring a new city/country, everyone enjoys meeting other travellers and bonding with them. A hotel that offers good entertainment services in the common room will attract many travellers to spend time there and in the process people spend time with one another
Hotel wifi service: This is very important when someone is travelling for business. A good wifi service ensures that one can do their work and do meetings while staying in their hotel room as well
Stay comfort: A comfortable stay is what everyone expects while booking a hotel especially when visiting a place far away from home. A comfortable stay ensures that the person is active and energetic throughout the day
Now look at the categorical variables and their coefficients. Write a paragraph with your opinion about the ‘type of travel’ and ‘purpose of travel’ coefficients shown here.
'Type of travel' impacts the model's outcome significantly and the coefficient suggests that travellers are less likely to be satisfied when the type of travel is Personal Travel as compared to Group Travel. This is because the the coefficient for personal travel type is negative.
'purpose of travel' also affects model's outcome. We have coefficients for 4 purpose of travel - aviation, business, personal and tourism. The 5th purpose, 'academic' forms the base case. The coefficient is negative which indicates that people travelling for academic purposes are most likely to have a satisfied stay at hotel. People travelling for aviation purpose are most likely to have neutral or dissatified stay at hotel.
# making predictions on test set
y_pred = log_reg.predict(X_test)
# confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
print(classification_report(y_test, y_pred))
# accuracy
acc=(conf_matrix[0,0]+conf_matrix[1,1])/np.sum(conf_matrix)
print ('Accuracy : ', acc)
# sensitivity
sensitivity = conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,0])
print('Sensitivity : ', sensitivity )
# specificity
specificity = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])
print('Specificity : ', specificity)
# precision
precision = conf_matrix[1,1]/(conf_matrix[0,1]+conf_matrix[1,1])
print('Precision : ', precision)
# balanced accuracy
bal_acc = (sensitivity+specificity)/2
print('Balanced Accuracy : ', bal_acc)
What is your model’s accuracy rate?
The model's accuracy rate is 83.95%
What is your model’s sensitivity rate?
The model's sensitivity rate is 81.58%
What is your model’s specificity rate?
The model's specificity rate is 85.76%
What is your model’s precision?
The model's precision is 0.814
What is your model’s balanced accuracy?
The model’s balanced accuracy is 83.67%
from sklearn.metrics import accuracy_score
train_preds = log_reg.predict(X_train)
print('Training Set Accuracy:', accuracy_score(y_train, train_preds))
print('Test Set Accuracy:', accuracy_score(y_test, y_pred))
What is the purpose of comparing those two values?
The purpose of comparing the two values is to understand whether the trained model is overfitting on the training dataset or not. Usually, if the model accuracy is high on training data set but low on test set, then our model is overfitting.
In this case, what does the comparison of those values suggest about the model that you have built?
In our case, the accuracy results on training dataset and test dataset are very close. This suggests that our logistic regression model is a good model and we do not face the issue of model overfitting on training dataset.
# made up traveller
traveller_data = [32, 4, 2, 4, 3, 2, 3, 4, 4, 3, 2, 1, 0, 0, 0, 1, 0, 1, 0]
new_traveler = pd.DataFrame([traveller_data], columns = X_train.columns)
new_traveler
# prediction
log_reg.predict(new_traveler)
What did your model predict -- will this person be satisfied?
The model predicts that the person will be satisfied.
# prediction probability
log_reg.predict_proba(new_traveler)
According to your model, what is the probability that the person will be satisfied?
The probability that the person will be satisfied is 0.549
When using a logistic regression model to make predictions, why is it important to only use values within the range of the dataset used to build the model?
It is important to use values within range of dataset because in Logistic Regression model, we are basically performing regression between predictor variables and logit function. The probability calculation involves multiplying the variable input value with its coefficient. Thus, if the values are out of range then the overall calculation of logit value will also be abnormal and hence the results will not be reliable. In other words, if there are very high or very low values (out of range), then these values start dominating the overall outcome of the model.
# made up traveller
traveller_1 = [125, 2, 8, 1, 9, 12, 3, 4, 4, 3, 2, 1, 0, 0, 0, 1, 0, 1, 0]
traveller_1 = pd.DataFrame([traveller_1], columns = X_train.columns)
traveller_1
# prediction
log_reg.predict(traveller_1)
# prediction probability
log_reg.predict_proba(traveller_1)
In this made up traveller, we have put some out of range values for Age, Departure/Arrival convenience, Hotel location and Food and drink variables. 3 of these variables have a negative impact on traveller's satisfaction probability and since we have put values higher than the normal range, we expect that the model will predict dissatisfied with very high probability.
We can see from the model's probability results as well, the model says the traveller has 94.25% chance that he will be neutral or dissatified with their stay. The probability is very high for dissatisfied which is what we expect as well since we have put out of range high values for the above mentioned variables.
# loading data
import pandas as pd
data = pd.read_csv('euro_hotels.csv')
data.head()
# data partition with same seed value - lucky number 9
from sklearn.preprocessing import LabelEncoder
y_encoder = LabelEncoder()
data['satisfaction'] = y_encoder.fit_transform(data['satisfaction'])
data.head()
rf_model_data = pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
rf_model_data.head()
rf_model_data.columns
# X and y variables
X = rf_model_data.drop(['id','satisfaction'], axis = 1)
y = rf_model_data['satisfaction']
print('Predictor Features Shape:',X.shape)
print('Target Variable Shape:',y.shape)
Using same seed number i.e. 9 to split data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=9)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
rf_classifier = RandomForestClassifier(random_state=9)
# defining parameter grid for grid search
params_grid = {'n_estimators': [100, 500], 'max_features': ['sqrt', 'log2'], 'max_depth' : [3,5,10],
'criterion' :['gini', 'entropy']}
# grid search to fit model
rf_grid_cv = GridSearchCV(estimator=rf_classifier, param_grid=params_grid, cv= 5, n_jobs=-1)
rf_grid_cv.fit(X_train, y_train)
# best accuracy
rf_grid_cv.best_score_
# best parameters
rf_grid_cv.best_params_
# fitting best model
best_rf = RandomForestClassifier(n_estimators=500, criterion='gini', max_depth=10, max_features='sqrt',
random_state=9)
best_rf.fit(X_train, y_train)
# feature importance plot
import matplotlib.pyplot as plt
import numpy as np
feat_names = X_train.columns
feat_imp = best_rf.feature_importances_
imp_order = np.argsort(feat_imp)
plt.figure(figsize=(12,6))
plt.barh(range(len(imp_order)), feat_imp[imp_order], color='gray', align='center')
plt.title('Feature Importances')
plt.yticks(range(len(imp_order)), [feat_names[i] for i in imp_order])
plt.xlabel('Relative Importance')
plt.show()
# feature importance dataframe
imp_data = pd.DataFrame(columns = ['Variable','Importance'])
imp_data['Variable'] = feat_names
imp_data['Importance'] = feat_imp
imp_data.sort_values(by='Importance', axis = 0, ascending=False).reset_index(drop = True)
For a random forest model, how can you interpret feature importance?
The feature importance describes which features are relevant for making prediction about the outcome. The feature importance calculated using the Random Forest model's inbuilt method calculates the Gini Importance. The model measure the decrease in impurity of the split for each feature. Across all trees in the forest, we average the average decrease in impurity for each feature. This average over all trees in the forest is the measure of the feature importance.
Thus, in simple terms, the higher the importance of a feature, the more important that feature is for the model to predict accurately.
# making predictions on test set
rf_preds = best_rf.predict(X_test)
# confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, rf_preds)
print(conf_matrix)
print(classification_report(y_test, rf_preds))
# accuracy
acc=(conf_matrix[0,0]+conf_matrix[1,1])/np.sum(conf_matrix)
print ('Accuracy : ', acc)
# sensitivity
sensitivity = conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,0])
print('Sensitivity : ', sensitivity )
# specificity
specificity = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])
print('Specificity : ', specificity)
# precision
precision = conf_matrix[1,1]/(conf_matrix[0,1]+conf_matrix[1,1])
print('Precision : ', precision)
# balanced accuracy
bal_acc = (sensitivity+specificity)/2
print('Balanced Accuracy : ', bal_acc)
What is your model’s accuracy rate?
The model's accuracy rate is 92.66%
What is your model’s sensitivity rate?
The model's sensitivity rate is 92.46%
What is your model’s specificity rate?
The model's specificity rate is 92.82%
What is your model’s precision?
The model's precision is 0.907
What is your model’s balanced accuracy?
The model’s balanced accuracy is 92.64%
from sklearn.metrics import accuracy_score
rf_train_preds = best_rf.predict(X_train)
print('Training Set Accuracy:', accuracy_score(y_train, rf_train_preds))
print('Test Set Accuracy:', accuracy_score(y_test, rf_preds))
How different were these results?
The results are comparable i.e. the accuracy results on training set and test set are very much similar. This suggests that our Random Forest model is not overfitting on dataset.
The made up traveller data needs to be adjusted as we kept all dummy variables for random forest.
# made up traveller
traveller_data_rf = [32, 4, 2, 4, 3, 2, 3, 4, 4, 3, 2, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0]
new_traveler_rf = pd.DataFrame([traveller_data_rf], columns = X_train.columns)
new_traveler_rf
# prediction
best_rf.predict(new_traveler_rf)
# prediction probability
best_rf.predict_proba(new_traveler_rf)
Does the model think this person will be satisfied?
No, the model does not think that the person will be satisfied. The probability of person being satisfied is 0.3549 and at threshold of 0.5, the model predicts that the person will not be satisfied.
Write a 3-5 sentence paragraph that speculates about why Lobster Land might care about being able to use this model.
Since Lobster Land is planning to open an on-site hotel on its theme park property, the customer satisfaction will be a big factor for the hotel to grow their business and attract more people to book hotel with them. Since they are establishing a new hotel and competing with existing options for customers, they need to understand what factors drive customer satisfaction so that their hotel has best services to offer to their customers. They can use the model to understand what particular features are important or drive customer satisfaction by analyzing the feature importance values from model. These values suggest that a lot of people prefer to have good wifi connection and common room entertainment services to hang out with other travellers. The model also suggests that people visiting through group bookings tend to feel more satisfied with their stay, thus, Lobster Land can come up with some special offers on group bookings which can attract group travellers and in the process their hotel can have a higher customer satisfaction rate.