question archive Artificial Intelligence Question 1: Consider a Naive Bayes classifier for spam filtering
Subject:Computer SciencePrice:2.86 Bought8
Artificial Intelligence Question 1:
Consider a Naive Bayes classifier for spam filtering. We are given a training set of 500 randomly chosen emails. We examine them and label 200 of them as spam emails and 300 as non-spam emails. There are 2000 (words) in the 200 spam emails. 200 spam emails contain the word "a"; 60 contain the word "good"; and 50 contain the word "job". In the 300 nonspam emails, there are in total 1000 (words). 150 non-spam emails contain the word "a''; 30 non-spam emails contain the word "good'', and 10 non-spam emails contain the word "job''. We use S to denote a random event that one email is found to be spam, and use NS to denote non-spam. We use P(word|S) to denote the conditional probability that one word word appears in a spam email (P(word|NS) is defined similarly). P(word1, word2|S) is the probability that both word1 and word2 appear in a spam email.
i. What is the best approximation to P( "a"|S) and P( "good"|S) and P("job"|S) given the training set?
ii. What is the best approximation to P( "a", "good", "job"|S) and P( "a", "good", "job"|NS) given the training set (Hint: using the structure of a Naive Bayes network to answer this question)?
iii. Given a testing email "Well done! You did a good job in CS471!". Will a Naive Bayes classifier trained on the training set above classify it as a spam? Why or why not? (Hint: you should make the decision based on P(S| "a", "good", "job") and P(NS| "a", "good", "job").)
i.P(a|S)=200/200=1 ,
P(good|S)=60/200 = 0.3,
P(job|S)=50/200=0.4 .
ii.
P(a,good,job|S)=P(a|S)P(good|S)P(job|S)=0.12
P(a,good,job|NS)=P(a|NS)P(good|NS)P(job|NS)=(150/300)*(30/300)*(10/300) = 0.001667
iii.
P(S|a,good,job)=(P(a,good,job|S)P(S))/P(a,good,job)=(0.12*(200/500))/P(a,good,job)}=0.48/P(a,good,job)
P(NS|a,good,job) = (0.001667*(300/500))/P(a,good,job) = 0.001/P(a,good,job)
P(a,good,job) is common in both => P(S|a,good,job)>P(NS|a,good,job)
so classify as Spam