A PLAN FOR SPAM

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

A PLAN FOR SPAM:

A PLAN FOR SPAM SOLUTION

(i) According to the author, what is content-based filtering, how is it different from other approaches to filtering.:

(i) According to the author, what is content-based filtering, how is it different from other approaches to filtering.

PowerPoint Presentation:

Answer: Content-based filtering develops rules regarding what constitutes spams based on the content of spams and non-spams. Other approaches try to identify individual spam features and build their filter to eliminate mail with the identified features.

(ii) Do you think content-based filtering may be more effective in eliminating spam? Why or why not? :

(ii) Do you think content-based filtering may be more effective in eliminating spam? Why or why not?

PowerPoint Presentation:

Answer: Yes content-based filtering may be more effective since the rules developed by using Bayesian analysis is automatically identifies new features introduced by spams to escape detection, like deliberate misspellings, etc. Also since not only the massage but also the header information including the server name, route used, etc. is used; such an approach may be more effective. To escape Bayesian filter, spams will have to become non spam like- statistically speaking, in which case their commercial value is lost to the spam firms.

(iii) In connection with spam filters, what are false positives? Would they be worse than false negatives? Why? :

(iii) In connection with spam filters, what are false positives? Would they be worse than false negatives? Why?

PowerPoint Presentation:

Answer: False positives are non-spams wrongly identified as spams and filter out. False negative are spams wrongly identified as non-spams and allowed in by the spam filters. False negative are irritating and annoying to the user while false positives may contain important and useful information wrongly filtered out. So, false positive are more costly to the user and so should be avoided at all cost.

PowerPoint Presentation:

(iv) Let us assume that the hash tables mentioned above have the following structure: Bad Good Word fbad fgood Word 1 fbad 1 fgood 1 word 2 fbad 2 fgood 2 . . . Sum nbad ngood Grand Total ntotal For a 'word', let fbad represent its frequency in the corpus of spam (i.e. bad) e-mails and fgood its frequency in the corpus of non-spam (i.e. good) e-mails. Let nbad be the total number of e-mails in the corpus of spam e-mails and similarly, let ngood be the total number of e-mails in the corpus of non-spam e-mails.

PowerPoint Presentation:

Finally, let the total number of e-mails in both the corpuses together be ntotal i.e. ntotal = nbad + ngood. It is also given that nbad = ngood. Show that for any word in the hash tables, the probability that an e-mail containing it is a spam is given by: and when nbad = ngood, Define your events carefully.

PowerPoint Presentation:

(iv) Answer: Given the structure of the hash tables Bad Good Word and Also and For the above, we have used the following events: bad: A mail picked at random is a spam good: A mail picked at random is a non-spam word: A mail picked at random will contain the “word”

PowerPoint Presentation:

(v) According to the author, “Based on my corpus ...of being a spam.’’ Can you justify this statement? [Hint: For the sake of easier understanding, define Probability (bad / word1) = p1 Probability (bad / word2) = p2 Then,

PowerPoint Presentation:

(v) Answer: Let us define two more events: word 1 : A mail picked at random will contain the “ word 1 ” word 2 : A mail picked at random will contain the “ word 2 ” Since word 1 and word 2 are mutually independent events and Also, and

PowerPoint Presentation:

So using Bayes theorem, when nbad = ngood

PowerPoint Presentation:

This can be simplified by using the previous results, i.e. and We can rewrite, where p 1 = 0.97 and p 2 = 0.99 as given in the text

PowerPoint Presentation:

( vi) What happens to this probability when a third word -say word 3 is added?

PowerPoint Presentation:

(iv) Answer: If Then P (bad / (all three words)) Now, through induction, this can be expanded further.

authorStream Live Help