2023 In this problem we will attempt to identify spam or ham SMS messages using the na ve Bayes | Assignment Collections
Computer Science 2023 Na¨ıve Bayes Classification
2023 In this problem we will attempt to identify spam or ham SMS messages using the na ve Bayes | Assignment Collections
In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.
Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:
P(x1:T|y) =
T Y t=1
P(xi|y). (3)
When estimated from dataset D with pseudo-count prior of α, the model parameters are:
ˆ P(xi|y) =
CountD(xi,y) + α CountD(y) + Nα
, (4)
where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and
4
α = 0.1 in our experiments.
Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classifier predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classifier’s performance.
Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:
True condition Positive Negative
Predicted Condition
Positive True positive False positive Negative False negative True negative
Other important performance measurements are precision, recall, and F-score, defined as:
precision =
true positive true positive + false positive
(5)
recall =
true positive true positive + false negative
(6)
F-score = 2·
precision·recall precision + recall
(7)
(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classifier in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)
(b) How does the change of α effect the classifier performance? Using random split above, evaluate the training and test accuracy and F-score under different selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the first plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)
Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more efficient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.
We give our students 100% satisfaction with their assignments, which is one of the most important reasons students prefer us to other helpers. Our professional group and planners have more than ten years of rich experience. The only reason is that we have successfully helped more than 100000 students with their assignments on our inception days. Our expert group has more than 2200 professionals in different topics, and that is not all; we get more than 300 jobs every day more than 90% of the assignment get the conversion for payment.