For this assignment, you will use the dataset shared with you in the "Data" worksheet of the "Churn-Dataset_DSB.xlsx" file.

The dataset consists of 20,000 examples (lines, rows) over 12 variables (fields, columns). The dataset constitutes a two-class supervised learning problem. The class variable, CHURN?, is the last variable on each line, and its legal values are LEAVE and STAY. Informally, here are the meanings of each variable in the dataset:

• COLLEGE : Is the customer college educated?

• INCOME: Annual income

• OVERAGE: Average overcharges per month

• LEFTOVER : Average % leftover minutes per month

• HOUSE : Value of dwelling (from census tract)

• HANDSET_PRICE : Cost of phone

• LONG_CALLS_PER_MONTH : Average number of long (>15 mins) calls per month

• AVERAGE_CALL_DURATION : Average call duration

• REPORTED_SATISFACTION : Reported level of satisfaction

• REPORTED_USAGE_LEVEL : Self-reported usage level

• CONSIDERING_CHANGE_OF_PLAN : Was customer considering changing his/her plan?

• CHURN? : Target variable: whether customer left or stayed

Analyze this dataset using scikit-learn and related libraries (e.g., Pandas, Seaborn, Matplotlib, etc.). Following the analysis template provided in the file Example-California-Housing-Prices - Modeling.ipynb, create classification model using similar data mining steps.

• Create a 20% hold out data set that should be used at the very end on the selected model after validation. Use a random seed of 42 for partitioning the data.

• Validate your model using 80% of the data using the 10-fold cross-validation strategy. • Develop the following different models of different types as follows:

o Decision Tree Classifier

o Naïve Bayes Classifier

. o Logistic Regression Classifier

o Linear SVC Classifier

o Kernelized Support Vector Machine (SVM) Classifier

o Random Forest Classifier

• You can use appropriate model parameters. For each model, feel free to alter the model parameters and test your model.

• Report the accuracy, precision and recall for each model. (Hint: Use sklearn.model_selection.cross_validate instead of sklearn.model_selection.cross_val_score so that you can retrieve multiple classification metrics of accuracy, precision, and recall. See sklearn.metrics to see available metrics.)

• Select the best model based on 10-fold cross-validation. (Optional - If possible, try to find the top predictors for your model. This will be possible for some but not all models.)

• Apply the best model on the 20% hold out data set and report the three metrics of accuracy, precision, and recall.

• IMPORTANT: Provide comments liberally throughout your Jupyter notebook

pur-new-sol

Related Questions