question archive University of Westminster School of Computer Science Coursework Description The Real-world Problem Description A) The Domain: The deployment of machine learning modelling in this coursework aims at tackling a long-term real-world disease burden; Obesity affects an increasing number of adults in the UK, with obesity-associated changes in Adipose Tissue, else known as visceral fat (figure 1), in particular, the accumulation of visceral fat (VF) is a critical factor in determining susceptibility to diseases such as diabetes, cancers, and cardiovascular diseases
Subject:Computer SciencePrice:0 Bought3
University of Westminster School of Computer Science
Coursework Description
The Real-world Problem Description
A) The Domain:
The deployment of machine learning modelling in this coursework aims at tackling a long-term real-world disease burden; Obesity affects an increasing number of adults in the UK, with obesity-associated changes in Adipose Tissue, else known as visceral fat (figure 1), in particular, the accumulation of visceral fat (VF) is a critical factor in determining susceptibility to diseases such as diabetes, cancers, and cardiovascular diseases. Certain levels of Visceral Fat accumulation around organs increase a person’s risk of developing chronic illnesses. Fig.1 Illustration of visceral fat (VF)
Suppose individuals with excessive amounts of visceral fat are identified. In that case, suitable diet, exercise and medical interventions can be recommended/prescribed to reduce such fat in an effort to reduce their risk of developing a chronic disease.
B) The Current Process of identifying VF amounts:
Visceral fat (VF) amounts in the human body are measured in units of litres. The method currently used to measure the amount of VF within the body is achieved by viewing the inside of the body via Magnetic Resonant Imaging (MRI) scan (see Figure 2). The result of the scan is multiple images that contain regions of interest. These regions are examined by a radiologist to indicate the location and amounts of visceral fat (VF) in the body, see Figure 3.
C) The Domain Problem:
The MRI scan cannot be done for everyone. The MRI scan process is expensive; without the NHS and the taxpayers' money, each MRI scan can cost an average of £450 to £600. Also, the MRI scan is time-consuming. A single scan can take up to 90 minutes to complete. The MRI images need to be examined by a radiologist to distinguish those with dangerous VF amounts; the results can take a couple of weeks to arrive. Due to cost, it is not feasible to do an MRI for everybody to establish those at risk of developing long-term critical illnesses. Also, VF cannot be predicted by the naked eye, some people are thin looking from the outside (slim), but they still could carry dangerous levels of VF inside (Thin Outside and Fat Outside). Again, some athletes, like sumo wrestlers, look thick on the outside (Obese), but they have low VF amounts; they are known as Thin Inside and Fat outside. See Figure 4.
From research, depending on VF amounts in litres, subjects’ risk of developing diseases can be either Low-Risk, Moderate-Risk or High-Risk. Unlike Low Risk, Moderate and High-Risk subjects require changes (interventions) to their lifestyle and diet. Unlike Moderate Risk subjects, High-Risk Subjects require urgent, or even immediate changes to lifestyle and diet.
Additional domain challenge is also linked to the physiological differences between males and females; dangerous VF levels in females differ from males. See Table 1 below.
Gender VF Levels Risk Level Require Intervention (Y/N)? Intervention Urgency
Females VF ≤2 L Low Risk No None
2< VF ≤5 L Moderate Risk Yes Medium
VF > 5 L High Risk High
Males VF ≤3 L Low Risk No None
3< VF ≤6 L Moderate Risk Yes Medium
VF > 6 L High Risk High
D) Your Role as A Data Scientist:
You are hired as a data scientist to work alongside a team of healthcare professionals to build machine learning models to tackle the problem of above problem of Visceral Fat obesity. The healthcare professionals have two historical datasets, one for males and the other for females, who underwent MRI in the past and recorded their VF levels in litres.
The Healthcare professionals are relying on your help to answer only one of the following three questions on just one of the two datasets:
Does machine learning have the potential to predict the risk levels (Low, Moderate, High) of Visceral Fat for new future subjects based on historical Visceral Fat data without the need for MRI scans in the future?
Does machine learning have the potential to predict new future subjects who need lifestyle and diet interventions (Intervention Required vs Not Required) based on the historical Visceral Fat data without the need for MRI scans in the future?
Does machine learning have the potential to predict the urgency of intervention for subjects who need lifestyle and diet changes (None, Medium, High) based on their Visceral Fat without the need to undergo MRI scans in the future?
E) Your Dataset
You are given two datasets, one for females and the other for males. As mentioned, due to physiological differences, one must not mix both cohorts, they are analysed independently. The dataset contains the following attributes:
Attribute Name Attribute Description Unit
SUBJECT_ID Unique alphanumeric identification number per subject None
SEX Subject’s gender M: Male and F: Females None
AGE Subject’s age. Years
BMI Body Mass Index. Calculated by: Weight(Kg)/[Height(m)]2 Kg/m2
HEIGHT Subject’s measured height on the scan day cm
WEIGHT Subject’s measured weight on the scan day kg
WAIST_CIRCUMFERENCE A length measurement around the middle of the body cm
DIASTOLIC_BLOOD_PRESSURE The pressure in arteries when heart rests between beats mmHg
SYSTOLIC_BLOOD_PRESSURE The pressure in arteries when heart beats mmHg
WALK_DURATION_PER_DAY Duration of subject’s walk in a single 24 hours (one day) Minutes
COMPUTER_USE_TIME_PER_DAY Duration of subject’s PC use in a single 24 hours (one day) Hours
SMOKING_STATUS 1: Smokers and 0: Non-smoker None
DISCONTINUED_NO Unknown None
CIGARETTES_CONSUMED_PER_DAY Count of cigarettes smoked in a single 24 hours (one day) Count
Visceral_Fat_Volume Visceral fat volume calculated from MRI scan Litres
The Machine Learning for Classification Coursework Tasks
As a data scientist, you are a logician, a mathematician, a technician, and an analyst, and you need healthcare professionals to understand your analysis. Healthcare professionals are busy individuals, and they don’t have all the time in the world. One essential skill that you must adhere to is to be straight to the point. Focus on the answer needed for each task, and provide enough words for the answer only. There is no need to provide lengthy descriptions of algorithms and methods unless you are asked to do. Also, they are only interested in assessing the results, so you must not paste any python code in this report unless you are specifically asked to. You will be provided with a separate link to submit your code as a python notebook file (mandatory). ipynb extension. Your data mining tasks will be aligned with the popular CRISP-DM methodology phases except for deployment (See figure 5)
Task (1) – Domain Understanding: Classification or Regression [Total 1 Mark]
It was decided by your healthcare team that this problem is a classification problem. Logically, describe in one sentence or two a potential reason for proceeding with classification predictive modelling instead of regression predictive modelling to answer any one of the healthcare professionals' questions. [1 Mark]
Task (2) – Data Understanding: Producing Your Experimental Designing [Total 7 Marks]
You are required to note down which gender dataset you want to use for modelling. Choose only one dataset. Note down your selected single healthcare question that you are attempting to answer in this coursework and the type of classification problem you have. [2 Marks]
From your python notebook, show a data frame of the first 10 records of the possible input variables and your class variable. Produce a statistical description and measurement scale type of your selected dataset attributes. Emphasize the distribution of your class variable. (Use screenshots of code outputs only). [5 Marks]
Task (3) – Data Preparation: Cleaning and Transforming your data [Total 30 Marks]
Investigate any found issues in your selected dataset and the possible variables. Use the table below to organise your findings: [10 Marks]
Dataset or Variable Name of variable Issue description
i.e., Variable i.e., Duration of sleep The issue with this variable is …………
i.e., Variable i.e., Job title The issue with this variable is …………
i.e., Whole Dataset Whole dataset The issue with this variable is …………
? ? ?
Based on the issues you found in your data, suggest a suitable possible solution to mitigate each of these issues and provide your justification for using each solution. Again, organise your answer in a table as before. See below: [10 Marks]
Dataset or Variable Name of variable The Issue Solution Justification
i.e., Variable
i.e., Variable
i.e., Whole Dataset
? ? ? ? ?
With the aid of Python packages and a notebook, implement your suggested solutions, and show evidence of implementing your suggested solutions to the problems you identified for your dataset in (a) and (b). (Use screenshots of code outputs only). Indicate which issue was resolved from each screenshot provided. Show screenshots of code outputs before and after implementing your solution. [10 Marks]
Task (4) – Modelling: Create Predictive Classification Models [Total 15 Marks]
From the classification algorithms which you learnt in the module, four different algorithms were selected, Naïve Bayes (NB), Decision Trees (DT), K-Nearest Neighbour (KNN) and Artificial Neural Networks (ANN) with a Multi-Layer Perceptron Architecture. These algorithms are a mix of parametric and non-parametric algorithms. List down the type of each algorithm (parametric vs non-parametric), name any hyperparameters for each algorithm which you may want to consider tuning, and the python package (python source code) used for calling each algorithm. Again, organise your answer in a table as before. See below:
[12 marks]
Algorithm Name Type of Algorithm Possible Hyper -parameters Python package source code to call the algorithm
NB
DT
KNN
ANN(MLP)
With the aid of the Python packages, using the training–test split approach, use the numeric input features only and the output feature, to build your predictive classification models. Screenshot the head of the data frame used for training the algorithms. In a just few sentences, justify your choice of the training-test split ratio, and provide a text reference if necessary. Provide as evidence the code line from your source code that guarantees that all models were tested on the same test dataset.
[3 Marks]
Task (5) – Evaluation: How good are your models [Total 47 Marks]
Your healthcare professionals provided the following success criteria that must guide you when evaluating your models.
“When evaluating your models' performance, which addresses your selected question out of the three questions, the first priority is identifying the largest number of subjects related to the high and moderate-risk groups to recommend interventions and potentially save lives. But you need to keep in mind, and it also needs to reasonably identify those who are related to low-risk, to minimise any unnecessary initiation of lifestyle and diet changes.”
With the aid of python packages, paste the test confusion matrix for each trained model as screenshots from the output of your python code. [4 marks]
With the aid of python packages, five different classification evaluation metrics are noted. State which of them you think are strongly related (interpret) to the above success criteria and which are not. Provide a justification for related or unrelated. Document the score for each built model. [30 marks]
Metric Name Related or Unrelated Justification in relation to the success criteria Model Name Metric Score
Accuracy NB
DT
KNN
ANN
Recall NB
DT
KNN
ANN
Precision NB
DT
KNN
ANN
F-Measure NB
DT
KNN
ANN
AUC-ROC NB
DT
KNN
ANN
Based on the related performance metrics scores you identified in (b), suggest the best classification model. Briefly describe how this model satisfies the needs of your healthcare professionals. [3 marks]
To enhance your best model’s generalisation performance, you can tune its hyperparameters which you indicated in (a) for that specific algorithm. With the aid of python packages, Re-train the algorithm again with GridSearchCV, and indicate the number of cross-validation K folds used. For the newly tuned model, document the estimated best hyperparameters, present the test confusion matrix and calculate and document the new scores of the related metrics to the success criteria identified in (b). Explain your observations on whether the tuning of hyperparameters enhanced the generalisation of your original best model.
[6 Marks]
Based on your best model, draft an answer for your selected research question, provide criticism of your best-performing model, and state any limitations you may have identified. Research and try to explain why your selected algorithm overtook all other models in no more than 200 words. [4 Marks]
Critical note about the structure of your submission:
This coursework is limited to a maximum of 7 pages. The minimum font is Arial size 12 single-spaced. A minimum of 1-inch page margins. Exceeding the page limit or going below the specified font size will result in an automatic 20% penalty deduction of your report’s mark.
Use the question numbers as headers; you do not need to copy the full question, you may summarise a new header from the question, but that is not important. It is crucial that your answers map to each question’s number and in the correct order. Otherwise, this may lead to a significant delay in marking your work and the potential of missing out on marks lost between the lines.
There is no need to go on a new venture with coding in python! Follow the process of process code reuse. For those who are new to python, all the python code that you need is given in your tutorial documents and solution python notebooks. All you need is to stitch it together from different tutorials to get the required outputs. However, I won’t stop you from going on a venture with new python coding.
Some of the submissions may be invited for a 20-minute viva. So be prepared to explain your findings should you have been invited for one. Failing to attend the viva may impact your mark.